Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: v2 data engine live upgrade #3282

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

derekbit
Copy link
Member

@derekbit derekbit commented Nov 22, 2024

Signed-off-by: Derek Su [email protected]

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9104

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

@derekbit derekbit self-assigned this Nov 22, 2024
Copy link

coderabbitai bot commented Nov 22, 2024

Walkthrough

The changes in this pull request introduce significant enhancements across multiple controllers and components in the Longhorn system, focusing on data engine upgrades and volume management. Key modifications include the addition of new controllers and methods for managing upgrades, refined error handling, and improved logging mechanisms. The introduction of new resource types, such as DataEngineUpgradeManager and NodeDataEngineUpgrade, along with their respective validators and mutators, establishes a structured approach to handling upgrade processes. Overall, the updates aim to enhance the robustness and functionality of the Longhorn storage system.

Changes

File Path Change Summary
controller/backup_controller.go Updated isResponsibleFor method logic; refined error handling in handleBackupDeletionInBackupStore.
controller/controller_manager.go Added dataEngineUpgradeManagerController and nodeDataEngineUpgradeController to StartControllers.
controller/engine_controller.go Modified syncEngine, added various instance management methods, and improved error handling.
controller/instance_handler.go Updated InstanceManagerHandler interface, refactored syncStatusWithInstanceManager, and enhanced state management.
controller/instance_handler_test.go Updated mock methods and added new methods for instance management.
controller/monitor/node_upgrade_monitor.go Introduced NodeDataEngineUpgradeMonitor for monitoring upgrade processes.
controller/monitor/upgrade_manager_monitor.go Added DataEngineUpgradeManagerMonitor for managing upgrade manager status.
controller/node_controller.go Enhanced node condition checks and logging for eviction requests.
controller/node_upgrade_controller.go Introduced NodeDataEngineUpgradeController for managing upgrade resources.
controller/replica_controller.go Updated CreateInstance and added methods for managing instance states.
controller/uninstall_controller.go Added methods for deleting upgrade managers and node upgrades.
controller/upgrade_manager_controller.go Introduced DataEngineUpgradeManagerController for managing upgrade manager resources.
controller/utils.go Removed isVolumeUpgrading function.
controller/volume_controller.go Enhanced volume upgrade handling and replica management logic.
datastore/datastore.go Added fields for managing data engine upgrades.
datastore/longhorn.go Introduced methods for managing DataEngineUpgradeManager and NodeDataEngineUpgrade.
k8s/crds.yaml Updated CRDs with new descriptions and added new CRDs for data engine upgrades.
k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go Introduced DataEngineUpgradeManager resource structure.
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go Added new fields and states for instance management.
k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go Introduced NodeDataEngineUpgrade resource structure.
k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go Added client interface methods for managing DataEngineUpgradeManager.
k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go Added client interface methods for managing NodeDataEngineUpgrade.
webhook/resources/dataengineupgrademanager/mutator.go Introduced mutator for DataEngineUpgradeManager creation.
webhook/resources/dataengineupgrademanager/validator.go Introduced validator for DataEngineUpgradeManager creation and updates.
webhook/resources/nodedataengineupgrade/mutator.go Introduced mutator for NodeDataEngineUpgrade creation.
webhook/resources/nodedataengineupgrade/validator.go Introduced validator for NodeDataEngineUpgrade creation and updates.

Assessment against linked issues

Objective Addressed Explanation
Support live upgrade for control plane (Issue #9104)

Possibly related PRs

Suggested reviewers

  • shuo-wu

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

mergify bot commented Nov 23, 2024

This pull request is now in conflict. Could you fix it @derekbit? 🙏

@derekbit derekbit marked this pull request as ready for review November 25, 2024 12:14
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 39

🧹 Outside diff range and nitpick comments (67)
webhook/resources/dataengineupgrademanager/mutator.go (2)

49-51: Enhance type assertion error message.

Include the actual type in the error message for better debugging.

-		return nil, werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.DataEngineUpgradeManager", newObj), "")
+		return nil, werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.DataEngineUpgradeManager (got %T)", newObj, newObj), "")

54-62: Improve error context for label operations.

The error handling could be more specific about what failed during the label operation.

-		err := errors.Wrapf(err, "failed to get label patch for upgradeManager %v", upgradeManager.Name)
+		err := errors.Wrapf(err, "failed to get label patch for upgradeManager %v: labels=%v", upgradeManager.Name, longhornLabels)
webhook/resources/nodedataengineupgrade/mutator.go (3)

21-28: Add nil check for datastore parameter

Consider adding validation for the datastore parameter to prevent potential nil pointer dereferences.

 func NewMutator(ds *datastore.DataStore) admission.Mutator {
+	if ds == nil {
+		panic("nil datastore")
+	}
 	return &nodeDataEngineUpgradeMutator{ds: ds}
 }

43-45: Consider removing unused datastore field

The ds field in the struct is currently unused. If it's not needed for future operations, consider removing it.


47-74: Enhance error handling and maintainability

Consider the following improvements:

  1. Use more specific error messages in type assertion
  2. Consider extracting operation names into constants
  3. Be consistent with error wrapping (some use errors.Wrapf, others use string formatting)
-		return nil, werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.NodeDataEngineUpgrade", newObj), "")
+		return nil, werror.NewInvalidError(fmt.Sprintf("expected *longhorn.NodeDataEngineUpgrade but got %T", newObj), "")

-		err := errors.Wrapf(err, "failed to get label patch for nodeUpgrade %v", nodeUpgrade.Name)
+		err = errors.Wrapf(err, "failed to get label patch for nodeUpgrade %v", nodeUpgrade.Name)
k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (1)

21-35: Consider adding validation for InstanceManagerImage

The status structure is well-designed for tracking upgrades across nodes. However, consider adding validation for the InstanceManagerImage field to ensure it follows the expected format (e.g., valid container image reference).

engineapi/instance_manager_test.go (2)

5-71: Consider adding more test cases for better coverage.

While the existing test cases cover basic scenarios, consider adding these additional cases to improve coverage:

  1. Empty replica addresses map
  2. Nil replica addresses map
  3. Case where initiator IP matches target IP
  4. Edge cases for port numbers (0, 65535, invalid ports)

Example additional test cases:

 tests := []struct {
     // ... existing fields
 }{
     // ... existing test cases
+    {
+        name:             "Empty replica addresses",
+        replicaAddresses: map[string]string{},
+        initiatorAddress: "192.168.1.3:9502",
+        targetAddress:    "192.168.1.3:9502",
+        expected:         map[string]string{},
+        expectError:      false,
+    },
+    {
+        name:             "Nil replica addresses",
+        replicaAddresses: nil,
+        initiatorAddress: "192.168.1.3:9502",
+        targetAddress:    "192.168.1.3:9502",
+        expected:         nil,
+        expectError:      true,
+    },
+    {
+        name: "Invalid port number",
+        replicaAddresses: map[string]string{
+            "replica1": "192.168.1.1:65536",
+        },
+        initiatorAddress: "192.168.1.3:9502",
+        targetAddress:    "192.168.1.3:9502",
+        expected:         nil,
+        expectError:      true,
+    },
 }

73-84: Enhance error messages and add documentation.

While the test execution is correct, consider these improvements:

  1. Make error messages more descriptive by including the test case name
  2. Document that no cleanup is required

Apply this diff to improve the error messages:

 t.Run(tt.name, func(t *testing.T) {
     result, err := getReplicaAddresses(tt.replicaAddresses, tt.initiatorAddress, tt.targetAddress)
     if (err != nil) != tt.expectError {
-        t.Errorf("expected error: %v, got: %v", tt.expectError, err)
+        t.Errorf("%s: expected error: %v, got: %v", tt.name, tt.expectError, err)
     }
     if !tt.expectError && !equalMaps(result, tt.expected) {
-        t.Errorf("expected: %v, got: %v", tt.expected, result)
+        t.Errorf("%s: expected addresses: %v, got: %v", tt.name, tt.expected, result)
     }
 })
k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go (4)

5-21: Add documentation for upgrade states and workflow.

The UpgradeState type defines a comprehensive set of states, but lacks documentation explaining:

  • The purpose and conditions for each state
  • The expected state transitions/workflow
  • Any timeout or error handling considerations

This documentation is crucial for maintainers and users implementing the upgrade logic.

Add documentation like this:

 type UpgradeState string
 
 const (
+    // UpgradeStateUndefined indicates the upgrade state hasn't been set
     UpgradeStateUndefined                = UpgradeState("")
+    // UpgradeStatePending indicates the upgrade is queued but not started
     UpgradeStatePending                  = UpgradeState("pending")
     // ... document remaining states ...
 )

35-40: Consider adding fields for better observability.

The current status structure could be enhanced with additional fields useful for troubleshooting:

  • LastTransitionTime: When the current state was entered
  • Conditions: Array of conditions following Kubernetes patterns
  • Progress: Numerical progress indicator

Example enhancement:

 type VolumeUpgradeStatus struct {
     // +optional
     State UpgradeState `json:"state"`
     // +optional
     Message string `json:"message"`
+    // +optional
+    LastTransitionTime *metav1.Time `json:"lastTransitionTime,omitempty"`
+    // +optional
+    Progress int32 `json:"progress,omitempty"`
 }

42-52: Add standard Kubernetes status fields.

Consider adding standard Kubernetes status fields for better integration:

  • Conditions array following Kubernetes patterns
  • ObservedGeneration for tracking spec changes

Example enhancement:

 type NodeDataEngineUpgradeStatus struct {
+    // +optional
+    Conditions []metav1.Condition `json:"conditions,omitempty"`
+    // +optional
+    ObservedGeneration int64 `json:"observedGeneration,omitempty"`
     // ... existing fields ...
 }

54-69: Add helpful printer columns for observability.

Consider adding more printer columns for better operational visibility:

  • Age: Standard column for resource age
  • Message: Latest status message

Add these printer columns:

 // +kubebuilder:printcolumn:name="State",type=string,JSONPath=`.status.state`,description="The current state of the node upgrade process"
+// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
+// +kubebuilder:printcolumn:name="Message",type="string",JSONPath=".status.message"
k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (2)

47-53: Consider adding type assertion safety checks

The type assertion m.(*v1beta2.DataEngineUpgradeManager) could panic if the indexer contains an object of the wrong type. Consider adding a type check:

 func (s *dataEngineUpgradeManagerLister) List(selector labels.Selector) (ret []*v1beta2.DataEngineUpgradeManager, err error) {
 	err = cache.ListAll(s.indexer, selector, func(m interface{}) {
-		ret = append(ret, m.(*v1beta2.DataEngineUpgradeManager))
+		if obj, ok := m.(*v1beta2.DataEngineUpgradeManager); ok {
+			ret = append(ret, obj)
+		}
 	})
 	return ret, err
 }

76-82: Consider adding type assertion safety checks in namespace List method

Similar to the main List method, the namespace-specific List method could benefit from safer type assertions:

 func (s dataEngineUpgradeManagerNamespaceLister) List(selector labels.Selector) (ret []*v1beta2.DataEngineUpgradeManager, err error) {
 	err = cache.ListAllByNamespace(s.indexer, s.namespace, selector, func(m interface{}) {
-		ret = append(ret, m.(*v1beta2.DataEngineUpgradeManager))
+		if obj, ok := m.(*v1beta2.DataEngineUpgradeManager); ok {
+			ret = append(ret, obj)
+		}
 	})
 	return ret, err
 }
k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go (1)

35-40: LGTM: Well-defined interface for upgrade monitoring

The NodeDataEngineUpgradeInformer interface provides a clean separation between the informer and lister functionalities, which is essential for implementing the live upgrade feature mentioned in the PR objectives.

This interface will be crucial for:

  • Watching upgrade status changes in real-time
  • Maintaining consistency during the upgrade process
  • Enabling rollback capabilities if needed
k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go (2)

63-73: Consider handling potential errors from ExtractFromListOptions.

The label extraction ignores potential errors from ExtractFromListOptions. While this is common in fake implementations, consider handling these errors for more robust testing scenarios.

-	label, _, _ := testing.ExtractFromListOptions(opts)
+	label, _, err := testing.ExtractFromListOptions(opts)
+	if err != nil {
+		return nil, err
+	}

105-107: Enhance UpdateStatus method documentation.

The comment about +genclient:noStatus could be more descriptive. Consider clarifying that this is a generated method for handling status updates of DataEngineUpgradeManager resources, which is crucial for tracking upgrade progress.

-// UpdateStatus was generated because the type contains a Status member.
-// Add a +genclient:noStatus comment above the type to avoid generating UpdateStatus().
+// UpdateStatus updates the Status subresource of DataEngineUpgradeManager.
+// This method is auto-generated due to the status field in the resource spec.
+// To disable generation, add +genclient:noStatus to the type definition.
k8s/pkg/apis/longhorn/v1beta2/node.go (1)

149-151: Enhance field documentation while implementation looks good.

The field implementation is correct and follows Kubernetes API conventions. Consider enhancing the documentation to provide more context:

-	// Request to upgrade the instance manager for v2 volumes on the node.
+	// Request to upgrade the instance manager for v2 volumes on the node.
+	// When set to true, the node controller will initiate the data engine upgrade process.
+	// This field should be set to false once the upgrade is complete.
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (4)

95-96: Add documentation and validation for TargetNodeID

The new TargetNodeID field lacks:

  1. Documentation comments explaining its purpose and usage
  2. Validation rules to ensure valid node IDs are provided

Consider adding:

  • A comment explaining when and how this field is used during upgrades
  • Validation using kubebuilder tags (e.g., for length or format)

115-119: Add documentation for target-related network fields

Please add documentation comments for the new network-related fields:

  • TargetIP
  • StorageTargetIP
  • TargetPort

These fields seem crucial for upgrade coordination and their purpose should be clearly documented.


129-132: Consider using an enum for instance replacement state

The boolean TargetInstanceReplacementCreated suggests a binary state. Consider using an enum instead to allow for more states in the future (e.g., "pending", "in_progress", "completed", "failed").


154-156: Clarify the purpose and relationship of standby ports

The new standby port fields need:

  1. Documentation comments explaining their purpose
  2. Clarification on how they relate to the existing TargetPortStart/End
  3. Architectural explanation of the standby mechanism in the upgrade process
controller/controller_manager.go (1)

226-227: LGTM! Consider adding comments for better documentation.

The controller execution follows the established pattern, using the same worker count and shutdown channel as other controllers.

Consider adding comments to document these new controllers, similar to the "Start goroutines for Longhorn controllers" comment above:

 	go volumeCloneController.Run(Workers, stopCh)
 	go volumeExpansionController.Run(Workers, stopCh)
+	// Start goroutines for data engine upgrade controllers
 	go dataEngineUpgradeManagerController.Run(Workers, stopCh)
 	go nodeDataEngineUpgradeController.Run(Workers, stopCh)
k8s/pkg/apis/longhorn/v1beta2/volume.go (1)

305-306: Consider documenting the node targeting lifecycle

The addition of these fields introduces another dimension to node targeting in Longhorn. To ensure maintainability and prevent confusion:

  1. Consider adding a comment block in the Volume type documentation explaining the relationship and lifecycle of all node-related fields:
    • NodeID vs TargetNodeID
    • CurrentNodeID vs CurrentTargetNodeID
    • MigrationNodeID vs CurrentMigrationNodeID
  2. Document the state transitions during the upgrade process
  3. Consider adding validation rules to prevent conflicting node assignments

Would you like me to help draft the documentation for the node targeting lifecycle?

Also applies to: 358-359

datastore/datastore.go (1)

48-96: Consider grouping related fields together

While the implementation is correct, consider grouping the data engine upgrade related fields with other engine-related fields for better code organization. This would improve the readability and maintainability of the code.

Consider reordering the fields to group them with other engine-related fields:

 	engineLister                     lhlisters.EngineLister
 	EngineInformer                   cache.SharedInformer
+	dataEngineUpgradeManagerLister   lhlisters.DataEngineUpgradeManagerLister
+	DataEngineUpgradeManagerInformer cache.SharedInformer
+	nodeDataEngineUpgradeLister      lhlisters.NodeDataEngineUpgradeLister
+	NodeDataEngineUpgradeInformer    cache.SharedInformer
 	replicaLister                    lhlisters.ReplicaLister
 	ReplicaInformer                  cache.SharedInformer
-	dataEngineUpgradeManagerLister   lhlisters.DataEngineUpgradeManagerLister
-	DataEngineUpgradeManagerInformer cache.SharedInformer
-	nodeDataEngineUpgradeLister      lhlisters.NodeDataEngineUpgradeLister
-	NodeDataEngineUpgradeInformer    cache.SharedInformer
types/types.go (1)

1271-1291: Consider adding validation for empty parameters.

The functions look good and follow the established patterns. However, consider adding validation for empty prefix and nodeID parameters in GenerateNodeDataEngineUpgradeName to prevent potential issues.

Apply this diff to add parameter validation:

 func GenerateNodeDataEngineUpgradeName(prefix, nodeID string) string {
+	if prefix == "" || nodeID == "" {
+		return ""
+	}
 	return prefix + "-" + nodeID + "-" + util.RandomID()
 }
webhook/resources/dataengineupgrademanager/validator.go (2)

70-72: Handle order-independent comparison for Nodes field

Using reflect.DeepEqual to compare the Nodes field may lead to false negatives if the order of nodes differs, even when they contain the same elements. Since the order of nodes is likely insignificant, consider sorting the slices before comparison to ensure an order-independent check.

Apply this diff to adjust the comparison:

import (
    // Existing imports
+   "sort"
)

func (u *dataEngineUpgradeManagerValidator) Update(request *admission.Request, oldObj runtime.Object, newObj runtime.Object) error {
    // Existing code
+   // Sort the Nodes slices before comparison
+   oldNodes := append([]string{}, oldUpgradeManager.Spec.Nodes...)
+   newNodes := append([]string{}, newUpgradeManager.Spec.Nodes...)
+   sort.Strings(oldNodes)
+   sort.Strings(newNodes)
-   if !reflect.DeepEqual(oldUpgradeManager.Spec.Nodes, newUpgradeManager.Spec.Nodes) {
+   if !reflect.DeepEqual(oldNodes, newNodes) {
        return werror.NewInvalidError("spec.nodes field is immutable", "spec.nodes")
    }
    // Existing code
}

44-44: Improve error messages by including the actual type received

In the type assertion error messages, including the actual type of the object received can aid in debugging. Modify the error messages to reflect the type.

Apply this diff to enhance the error messages:

- return werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.DataEngineUpgradeManager", newObj), "")
+ return werror.NewInvalidError(fmt.Sprintf("%T is not a *longhorn.DataEngineUpgradeManager", newObj), "")

Similarly, update lines 58 and 62:

- return werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.DataEngineUpgradeManager", oldObj), "")
+ return werror.NewInvalidError(fmt.Sprintf("%T is not a *longhorn.DataEngineUpgradeManager", oldObj), "")
- return werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.DataEngineUpgradeManager", newObj), "")
+ return werror.NewInvalidError(fmt.Sprintf("%T is not a *longhorn.DataEngineUpgradeManager", newObj), "")

Also applies to: 58-58, 62-62

controller/node_upgrade_controller.go (3)

259-268: Ensure proper deep copy of status volumes

When copying the Volumes map from the monitor status to nodeUpgrade.Status.Volumes, a shallow copy can lead to unintended side effects.

Apply this diff to perform a deep copy of each VolumeUpgradeStatus:

nodeUpgrade.Status.State = status.State
nodeUpgrade.Status.Message = status.Message
nodeUpgrade.Status.Volumes = make(map[string]*longhorn.VolumeUpgradeStatus)
for k, v := range status.Volumes {
-   nodeUpgrade.Status.Volumes[k] = &longhorn.VolumeUpgradeStatus{
-       State:   v.State,
-       Message: v.Message,
-   }
+   nodeUpgrade.Status.Volumes[k] = v.DeepCopy()
}

This ensures that each VolumeUpgradeStatus is independently copied.


128-145: Correct the use of maxRetries in handleErr

Assuming maxRetries is defined within the controller struct, it should be referenced using uc.maxRetries.

Apply this diff to correctly reference maxRetries:

-   if uc.queue.NumRequeues(key) < maxRetries {
+   if uc.queue.NumRequeues(key) < uc.maxRetries {
        handleReconcileErrorLogging(log, err, "Failed to sync Longhorn nodeDataEngineUpgrade resource")
        uc.queue.AddRateLimited(key)
        return
    }

Ensure that maxRetries is properly defined as a field in the controller.


86-94: Add logging to enqueueNodeDataEngineUpgrade for better traceability

Including logging when enqueueing items can help in debugging and monitoring the controller's workflow.

Apply this diff to add debug logging:

uc.queue.Add(key)
+ uc.logger.WithField("key", key).Debug("Enqueued NodeDataEngineUpgrade for processing")

This provides visibility into when items are added to the queue.

controller/upgrade_manager_controller.go (5)

55-58: Address the TODO comment regarding the event recorder wrapper

There is a TODO comment indicating that the wrapper should be removed once all clients have moved to use the clientset. Consider addressing this to clean up the codebase if appropriate.

Do you want me to help remove the wrapper and update the code accordingly?


204-206: Add nil check before closing the monitor

In reconcile, there is a potential nil pointer dereference if uc.dataEngineUpgradeManagerMonitor is already nil.

Add a nil check to ensure safety:

 if uc.dataEngineUpgradeManagerMonitor != nil {
     uc.dataEngineUpgradeManagerMonitor.Close()
     uc.dataEngineUpgradeManagerMonitor = nil
 }

217-227: Optimize status update comparison

Using reflect.DeepEqual to compare the entire status can be inefficient. This may impact performance, especially with large structs.

Consider comparing specific fields that are expected to change or use a hash to detect changes more efficiently.


175-181: Simplify error handling in reconcile

The error handling logic can be streamlined for better readability.

Refactor the error handling as follows:

upgradeManager, err := uc.ds.GetDataEngineUpgradeManager(upgradeManagerName)
if err != nil {
    if apierrors.IsNotFound(err) {
        return nil
    }
    return err
}

48-53: Consistent parameter ordering in constructor

In NewDataEngineUpgradeManagerController, the parameters controllerID and namespace are passed at the end. For consistency with other controllers, consider placing these parameters earlier in the argument list.

Reorder the parameters for consistency:

 func NewDataEngineUpgradeManagerController(
     logger logrus.FieldLogger,
     ds *datastore.DataStore,
     scheme *runtime.Scheme,
     controllerID string,
     namespace string,
     kubeClient clientset.Interface) (*DataEngineUpgradeManagerController, error) {
controller/monitor/upgrade_manager_monitor.go (3)

26-27: Typographical error in constant naming

The constant DataEngineUpgradeMonitorMonitorSyncPeriod has an extra "Monitor" in its name. Consider renaming it to DataEngineUpgradeManagerMonitorSyncPeriod for clarity and consistency.

Apply this diff to fix the naming:

-	DataEngineUpgradeMonitorMonitorSyncPeriod = 5 * time.Second
+	DataEngineUpgradeManagerMonitorSyncPeriod = 5 * time.Second

41-41: Use conventional naming for cancel function

The variable quit is used for the cancel function returned by context.WithCancel, but it's more conventional to name it cancel for clarity.

Apply this diff to rename the variable:

-	ctx, quit := context.WithCancel(context.Background())
+	ctx, cancel := context.WithCancel(context.Background())

Also, update all references of quit to cancel in the monitor:

-	m.quit()
+	m.cancel()

331-331: Unresolved TODO: Check for untracked node data engine upgrades

There is a TODO comment indicating that the code should check if there are any NodeDataEngineUpgrade resources in progress but not tracked by m.upgradeManagerStatus.UpgradingNode. Addressing this is important to ensure that all in-progress upgrades are correctly monitored.

Would you like assistance in implementing this logic?

webhook/resources/volume/mutator.go (2)

49-61: Add unit tests for areAllDefaultInstanceManagersStopped

To ensure the correctness and reliability of the areAllDefaultInstanceManagersStopped function, consider adding unit tests that cover various scenarios, such as:

  • All default instance managers are stopped.
  • Some default instance managers are not stopped.
  • Error handling when listing instance managers fails.

This will help prevent regressions and improve maintainability.


63-86: Add unit tests for getActiveInstanceManagerImage

Adding unit tests for getActiveInstanceManagerImage will help validate its behavior in different scenarios:

  • When all default instance managers are stopped, and there is at least one non-default instance manager.
  • When all default instance managers are stopped, but there are no non-default instance managers.
  • When not all default instance managers are stopped.

This will improve code reliability and ease future maintenance.

webhook/resources/volume/validator.go (6)

104-104: Clarify the error message for empty engine image

The error message "BUG: Invalid empty Setting.EngineImage" may confuse users. Consider removing "BUG:" to make it clearer, e.g., "Invalid empty Setting.EngineImage".


131-133: Ensure consistent error handling by wrapping errors

The error returned at line 133 is not wrapped with werror.NewInvalidError. For consistency with other error returns in the code, consider wrapping the error:

return werror.NewInvalidError(err.Error(), "")

144-145: Verify compatibility check function name

Ensure that the function CheckDataEngineImageCompatiblityByImage is correctly named. The word "Compatiblity" appears to be misspelled; it should be "Compatibility".


165-177: Redundant condition check for volume.Spec.NodeID

Within the if volume.Spec.NodeID != "" block starting at line 165, there is another check for volume.Spec.NodeID != "" at line 173. This check is redundant and can be removed.


298-347: Refactor repeated validation checks into a helper function

Multiple blocks between lines 298-347 perform similar validation checks for different volume specifications when using data engine v2. To improve maintainability and reduce code duplication, consider refactoring these checks into a helper function.

For example, create a function:

func (v *volumeValidator) validateImmutableFieldsForDataEngineV2(oldVolume, newVolume *longhorn.Volume, fieldName string, oldValue, newValue interface{}) error {
    if !reflect.DeepEqual(oldValue, newValue) {
        err := fmt.Errorf("changing %s for volume %v is not supported for data engine %v", fieldName, newVolume.Name, newVolume.Spec.DataEngine)
        return werror.NewInvalidError(err.Error(), "")
    }
    return nil
}

And then use it for each field:

if err := v.validateImmutableFieldsForDataEngineV2(oldVolume, newVolume, "backing image", oldVolume.Spec.BackingImage, newVolume.Spec.BackingImage); err != nil {
    return err
}

408-409: Typographical error in error message

In the error message at line 409, "unable to set targetNodeID for volume when the volume is not using data engine v2", the field should be formatted consistently. Consider quoting spec.targetNodeID for clarity.

- "unable to set targetNodeID for volume when the volume is not using data engine v2"
+ "unable to set spec.targetNodeID for volume when the volume is not using data engine v2"
controller/replica_controller.go (1)

611-613: Implement the placeholder methods or clarify their future use

The methods SuspendInstance, ResumeInstance, SwitchOverTarget, DeleteTarget, and RequireRemoteTargetInstance currently return default values without any implementation. If these methods are intended for future functionality, consider adding appropriate implementations or TODO comments to indicate pending work.

Would you like assistance in implementing these methods or creating a GitHub issue to track their development?

Also applies to: 615-617, 619-621, 623-625, 627-629

controller/monitor/node_upgrade_monitor.go (1)

220-222: Consistent error variable naming for clarity

The variable errList used here refers to a single error instance. Using errList may imply it contains multiple errors, which can be misleading.

Consider renaming errList to err for consistency:

- replicas, errList := m.ds.ListReplicasByNodeRO(nodeUpgrade.Status.OwnerID)
- if errList != nil {
-     err = errors.Wrapf(errList, "failed to list replicas on node %v", nodeUpgrade.Status.OwnerID)
+ replicas, err := m.ds.ListReplicasByNodeRO(nodeUpgrade.Status.OwnerID)
+ if err != nil {
+     err = errors.Wrapf(err, "failed to list replicas on node %v", nodeUpgrade.Status.OwnerID)
engineapi/instance_manager.go (1)

532-555: Improve error messages by including invalid addresses

The error messages in the getReplicaAddresses function can be more informative by including the invalid address that caused the error. This will aid in debugging and provide better context.

Apply the following diff to enhance the error messages:

- return nil, errors.New("invalid initiator address format")
+ return nil, fmt.Errorf("invalid initiator address format: %s", initiatorAddress)

- return nil, errors.New("invalid target address format")
+ return nil, fmt.Errorf("invalid target address format: %s", targetAddress)

- return nil, errors.New("invalid replica address format")
+ return nil, fmt.Errorf("invalid replica address format: %s", addr)
controller/backup_controller.go (1)

599-602: Handle Node NotFound error explicitly

In the isResponsibleFor method, if the node resource is not found, the current code returns false, err. A missing node could signify that the node has been removed from the cluster, and the controller should treat it as not responsible without raising an error.

Consider handling the NotFound error explicitly:

 node, err := bc.ds.GetNodeRO(bc.controllerID)
 if err != nil {
+  if apierrors.IsNotFound(err) {
+    return false, nil
+  }
   return false, err
 }
controller/instance_handler.go (1)

214-245: Simplify nested conditional logic for better readability

The nested conditional statements between lines 214-245 are complex and may reduce readability. Consider refactoring the code to simplify the logic, which will enhance maintainability and make future modifications easier.

scheduler/replica_scheduler.go (4)

Line range hint 480-480: Address the TODO comment regarding V2 rebuilding

The TODO comment on line 480 indicates that the code handling for reusing failed replicas during V2 rebuilding is temporary. To ensure clarity and proper tracking, please consider creating an issue or task to remove or update this code once failed replica reuse is supported in V2.

Would you like assistance in creating a GitHub issue to track this TODO?


Line range hint 439-439: Remove or resolve the 'Investigate' comment

The comment // Investigate suggests that further attention is needed for the getDiskWithMostUsableStorage function. Leaving such comments can cause confusion. Please either address the underlying issue or remove the comment.

Would you like assistance in reviewing this function to address any concerns?


Line range hint 440-444: Simplify the initialization of diskWithMostUsableStorage

The variable diskWithMostUsableStorage is initialized with an empty Disk struct and then immediately reassigned in the loop. This is unnecessary and could be simplified. Consider initializing it directly from the first element in the disks map.

Apply this diff to simplify the initialization:

-func (rcs *ReplicaScheduler) getDiskWithMostUsableStorage(disks map[string]*Disk) *Disk {
-    diskWithMostUsableStorage := &Disk{}
-    for _, disk := range disks {
-        diskWithMostUsableStorage = disk
-        break
-    }
+func (rcs *ReplicaScheduler) getDiskWithMostUsableStorage(disks map[string]*Disk) *Disk {
+    var diskWithMostUsableStorage *Disk
+    for _, disk := range disks {
+        diskWithMostUsableStorage = disk
+        break
     }

Line range hint 88-88: Avoid reusing variable names like multiError to prevent confusion

The variable multiError is declared multiple times within the FindDiskCandidates function, which can lead to readability issues and potential bugs due to variable shadowing. Consider renaming these variables or restructuring the code to improve clarity.

Also applies to: 111-111

controller/engine_controller.go (6)

435-467: Improve error handling in findInstanceManagerAndIPs

The function findInstanceManagerAndIPs can be enhanced to handle errors more gracefully. Specifically, when retrieving targetIM, if an error occurs, providing more context can help in debugging.

Consider wrapping the error with additional context:

 if e.Spec.TargetNodeID != "" {
     targetIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, true)
     if err != nil {
-        return nil, "", "", err
+        return nil, "", "", errors.Wrap(err, "failed to get target instance manager")
     }

Line range hint 2419-2476: Enhance error messages in Upgrade method

In the Upgrade method for DataEngineV2, error messages can be improved to provide more context, especially when an instance is not found or not running.

Enhance error messages for clarity:

 if _, ok := im.Status.InstanceEngines[e.Name]; !ok {
-    return fmt.Errorf("target instance %v is not found in engine list", e.Name)
+    return fmt.Errorf("target instance %v is not found in instance manager %v engine list", e.Name, im.Name)
 }

Line range hint 2419-2476: Avoid code duplication in instance existence checks

The code blocks checking for the existence of the initiator and target instances in the Upgrade method are nearly identical. Refactoring to a helper function can reduce duplication and improve maintainability.

Consider creating a helper function:

func checkInstanceExists(im *longhorn.InstanceManager, e *longhorn.Engine, role string) error {
    if _, ok := im.Status.InstanceEngines[e.Name]; !ok {
        return fmt.Errorf("%s instance %v is not found in instance manager %v engine list", role, e.Name, im.Name)
    }
    return nil
}

Then use it:

if err := checkInstanceExists(im, e, "initiator"); err != nil {
    return err
}
// ...
if err := checkInstanceExists(im, e, "target"); err != nil {
    return err
}

Line range hint 2545-2583: Clarify logic in isResponsibleFor method

The isResponsibleFor method contains complex logic that could benefit from additional comments explaining the decision-making process, especially around the isPreferredOwner, continueToBeOwner, and requiresNewOwner variables.

Add comments to improve readability:

// Determine if the current node is the preferred owner and has the data engine available
isPreferredOwner := currentNodeDataEngineAvailable && isResponsible

// Continue to be the owner if the preferred owner doesn't have the data engine available, but the current owner does
continueToBeOwner := currentNodeDataEngineAvailable && !preferredOwnerDataEngineAvailable && ec.controllerID == e.Status.OwnerID

// Require new ownership if neither the preferred owner nor the current owner have the data engine, but the current node does
requiresNewOwner := currentNodeDataEngineAvailable && !preferredOwnerDataEngineAvailable && !currentOwnerDataEngineAvailable

646-673: Ensure consistency in error messages

In the SuspendInstance method, error messages use different formats. For instance, some messages start with a lowercase letter, while others start with uppercase. Ensuring consistency improves readability and professionalism.

Standardize error messages:

 return fmt.Errorf("Invalid object for engine instance suspension: %v", obj)
 // ...
 return fmt.Errorf("Suspending engine instance is not supported for data engine %v", e.Spec.DataEngine)

750-760: Handle potential errors when switching over target

In SwitchOverTarget, after obtaining the targetIM and initiatorIM, the code proceeds to use their IPs. It would be prudent to check if these IPs are valid (non-empty) before proceeding, to prevent issues with network communication.

Add checks for valid IPs:

if targetIM.Status.IP == "" {
    return fmt.Errorf("target instance manager IP is empty for engine %v", e.Name)
}
if initiatorIM.Status.IP == "" {
    return fmt.Errorf("initiator instance manager IP is empty for engine %v", e.Name)
}
controller/volume_controller.go (4)

1009-1010: Fix typo in comment for clarity

Correct the typo in the comment. Change "something must wrong" to "something must be wrong".

Apply this diff to fix the typo:

-        // r.Spec.Active shouldn't be set for the leftover replicas, something must wrong
+        // r.Spec.Active shouldn't be set for the leftover replicas; something must be wrong

1834-1834: Improve grammar in comment for better readability

Modify the comment for clarity. Change "the image of replica is no need to be the same" to "the replica's image does not need to be the same".

Apply this diff to improve the comment:

-                // For v2 volume, the image of replica is no need to be the same as the volume image
+                // For v2 volume, the replica's image does not need to be the same as the volume image

3208-3210: Remove empty else block to simplify code

The else block at line 3208 is empty and can be removed to improve code clarity.

Apply this diff to remove the empty else block:

          } else {
-           // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
          }
+         // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
🧰 Tools
🪛 golangci-lint (1.61.0)

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)


3221-3226: Simplify code by unwrapping unnecessary else block

Since the if block at line 3219 returns, the else block is unnecessary and can be unwrapped for better readability.

Apply this diff to simplify the code:

        if replicaAddressMap, err := c.constructReplicaAddressMap(v, e, rs); err != nil {
          return nil
        }
-       } else {
          if !reflect.DeepEqual(e.Spec.UpgradedReplicaAddressMap, replicaAddressMap) {
            e.Spec.UpgradedReplicaAddressMap = replicaAddressMap
            return nil
          }
-       }
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)

datastore/longhorn.go (1)

5641-5875: Recommend adding unit tests for new DataEngineUpgradeManager methods

Consider adding unit tests to cover the new methods added for DataEngineUpgradeManager and NodeDataEngineUpgrade resources to ensure correct functionality and future maintainability.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 612db95 and 393807e.

⛔ Files ignored due to path filters (6)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/api/instance.go is excluded by !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/imrpc/instance.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (51)
  • controller/backup_controller.go (1 hunks)
  • controller/controller_manager.go (2 hunks)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • controller/monitor/upgrade_manager_monitor.go (1 hunks)
  • controller/node_controller.go (2 hunks)
  • controller/node_upgrade_controller.go (1 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/uninstall_controller.go (4 hunks)
  • controller/upgrade_manager_controller.go (1 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • datastore/datastore.go (3 hunks)
  • datastore/longhorn.go (6 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • k8s/crds.yaml (84 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (6 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/node.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/register.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/volume.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go (3 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/generic.go (2 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (4 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go (2 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/mutator.go (2 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
✅ Files skipped from review due to trivial changes (4)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go
🧰 Additional context used
📓 Learnings (1)
webhook/server/mutation.go (1)
Learnt from: mantissahz
PR: longhorn/longhorn-manager#2182
File: webhook/server/mutation.go:46-46
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In the Longhorn codebase, other custom resource mutators do not implement the `Mutate` method.
🪛 golangci-lint (1.61.0)
controller/uninstall_controller.go

209-209: Error return value of ds.DataEngineUpgradeManagerInformer.AddEventHandler is not checked

(errcheck)


213-213: Error return value of ds.NodeDataEngineUpgradeInformer.AddEventHandler is not checked

(errcheck)

controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (103)
k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go (1)

35-36: LGTM! The new expansion interfaces align with Kubernetes patterns.

The addition of DataEngineUpgradeManagerExpansion and NodeDataEngineUpgradeExpansion interfaces follows the standard Kubernetes client-go pattern for extensibility. These interfaces support the v2 data engine live upgrade feature by providing extension points for the new resource types.

Let's verify that the corresponding resource types exist:

Also applies to: 45-46

✅ Verification successful

The expansion interfaces correctly correspond to their CRD types

The verification confirms that both expansion interfaces have their corresponding Custom Resource Definition (CRD) types properly defined in the API package:

  • DataEngineUpgradeManager is defined in k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go
  • NodeDataEngineUpgrade is defined in k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go

Both types follow the standard Kubernetes CRD structure with TypeMeta, ObjectMeta, Spec, and Status fields.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of corresponding CRD types for the new expansion interfaces

# Check for DataEngineUpgradeManager type
rg -A 5 "type DataEngineUpgradeManager struct" k8s/pkg/apis/longhorn/v1beta2/

# Check for NodeDataEngineUpgrade type
rg -A 5 "type NodeDataEngineUpgrade struct" k8s/pkg/apis/longhorn/v1beta2/

Length of output: 1327

k8s/pkg/apis/longhorn/v1beta2/register.go (2)

68-71: LGTM! The new types follow Kubernetes API conventions.

The addition of DataEngineUpgradeManager and NodeDataEngineUpgrade types (and their List counterparts) is well-structured and maintains the alphabetical ordering in the scheme registration.


68-71: Verify API versioning implications.

Adding new types to the v1beta2 API group is a significant change. Please ensure:

  1. These changes don't require an API version bump
  2. Any necessary migration paths are in place
  3. API backward compatibility is maintained
✅ Verification successful

Let me gather more information about API versioning and compatibility by checking for conversion code and CRD definitions specifically for these new types.


Let me gather more information about the API versioning and conversion setup by checking for conversion webhooks and CRD specifications.


Based on the gathered information, I can now provide a final response:

Adding new types to v1beta2 API is safe with existing conversion infrastructure

The codebase has proper API versioning support in place:

  • A conversion webhook service is configured and running (longhorn-conversion-webhook)
  • The webhook is a prerequisite for the manager's readiness
  • New types are being added to the existing v1beta2 API group which is the current storage version

No API version bump is needed as:

  1. These are new types being introduced, not modifications to existing ones
  2. The conversion webhook infrastructure is already in place to handle future versioning needs
  3. The changes maintain backward compatibility since they're additive
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for API version related files and CRD definitions

# Check for API migration code
echo "Checking for API migration code..."
rg -g '!vendor' -l 'conversion.*v1beta' 

# Check for CRD definitions
echo "Checking for CRD definitions..."
fd -e yaml -e yml . | xargs rg -l 'DataEngineUpgradeManager\|NodeDataEngineUpgrade'

# Check for version-related comments/docs
echo "Checking for version-related documentation..."
rg -g '!vendor' 'v1beta2.*upgrade' 

Length of output: 1648


Script:

#!/bin/bash
# Check for conversion code and API compatibility

# Look for conversion functions or hub/conversion implementations
echo "Checking for conversion implementations..."
rg -g '!vendor' 'func.*Convert.*DataEngineUpgrade'

# Check for CRD definitions of the new types
echo "Checking CRD yaml files..."
fd crd | xargs cat | grep -A 10 "kind: CustomResourceDefinition" | grep -A 10 "longhorn.io"

# Check for any API compatibility comments or documentation
echo "Checking for API compatibility documentation..."
rg -g '!vendor' -B 2 -A 2 'v1beta2.*DataEngineUpgrade'

# Check for hub versions or storage versions
echo "Checking for storage version markers..."
rg -g '!vendor' 'storage.*=.*true' k8s/

Length of output: 61375


Script:

#!/bin/bash
# Check for conversion webhook configurations and CRD specifications

# Look for conversion webhook configurations
echo "Checking for conversion webhook configurations..."
rg -g '!vendor' -B 3 -A 3 'conversion.*webhook'

# Check CRD specifications for storage versions
echo "Checking CRD specifications for storage versions..."
fd -e yaml -e yml . | xargs grep -A 5 "versions:"

# Look for API compatibility comments or documentation
echo "Checking for API compatibility documentation..."
rg -g '!vendor' -B 2 -A 2 'v1beta.*compatibility'

Length of output: 14476

webhook/server/validation.go (2)

13-13: LGTM: Import additions are well-organized

The new imports for data engine upgrade components follow the project's package structure and are properly organized with other resource imports.

Also applies to: 17-17


52-53: LGTM: Validator additions follow established patterns

The new validators for data engine upgrades are properly integrated into the validation chain and follow the same initialization pattern as other resource validators.

Let's verify the implementation of these validators:

✅ Verification successful

Let me try a different approach to verify the validator implementations.


Validators are properly implemented with all required interface methods

Both validators follow the correct implementation pattern with:

  • Proper struct definitions embedding DefaultValidator
  • Complete implementation of admission.Validator interface:
    • Resource() method defining the resource names
    • Create() method for validation of new resources
    • Update() method for validation of resource modifications
    • Proper type assertions for their respective types (DataEngineUpgradeManager and NodeDataEngineUpgrade)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of the new validators
# Expected: Each validator should implement the required admission.Validator interface methods

# Check for the validator implementations
echo "Checking DataEngineUpgradeManager validator..."
ast-grep --pattern 'type $_ struct {
  $$$
}

func (v $_) Resource() admission.Resource {
  $$$
}

func (v $_) Create($_) error {
  $$$
}

func (v $_) Update($_) error {
  $$$
}

func (v $_) Delete($_) error {
  $$$
}' webhook/resources/dataengineupgrademanager/validator.go

echo "Checking NodeDataEngineUpgrade validator..."
ast-grep --pattern 'type $_ struct {
  $$$
}

func (v $_) Resource() admission.Resource {
  $$$
}

func (v $_) Create($_) error {
  $$$
}

func (v $_) Update($_) error {
  $$$
}

func (v $_) Delete($_) error {
  $$$
}' webhook/resources/nodedataengineupgrade/validator.go

Length of output: 750


Script:

#!/bin/bash
# Let's check if these files exist first and then look at their content
echo "Checking for validator files..."
fd validator.go -p webhook/resources/dataengineupgrademanager
fd validator.go -p webhook/resources/nodedataengineupgrade

echo -e "\nChecking DataEngineUpgradeManager validator content..."
rg -A 2 "type.*struct|func.*Resource|func.*Create|func.*Update|func.*Delete" webhook/resources/dataengineupgrademanager/validator.go || true

echo -e "\nChecking NodeDataEngineUpgrade validator content..."
rg -A 2 "type.*struct|func.*Resource|func.*Create|func.*Update|func.*Delete" webhook/resources/nodedataengineupgrade/validator.go || true

Length of output: 2104

webhook/resources/dataengineupgrademanager/mutator.go (4)

21-24: LGTM! Clean struct definition following composition pattern.

The struct is well-designed, embedding the DefaultMutator and maintaining necessary dependencies.


1-74: Overall implementation looks solid!

The mutator implementation follows good practices with proper error handling, clear separation of concerns, and follows the webhook pattern. The suggested improvements are minor enhancements to improve debugging and robustness.


47-74: Consider validating upgradeManager.Name.

The mutate function should validate that upgradeManager.Name is non-empty before using it in error messages.


37-39: Verify if Update operation should be supported.

Given that this is for live upgrades, should the mutator also handle Update operations for DataEngineUpgradeManager resources?

✅ Verification successful

Based on the gathered information from all the executed scripts, I can now provide a final response:

Update operation should be omitted for this mutating webhook

The current implementation correctly handles only Create operations for the DataEngineUpgradeManager webhook. While there are Update operations in the client code (as seen in the clientset), these are for normal API operations. The mutation webhook specifically focuses on initial creation because:

  1. The client code shows separate paths for Updates and UpdateStatus operations, which are handled by the controller layer rather than admission webhooks
  2. The DataEngineUpgradeManagerController exists to manage the lifecycle after creation
  3. No other webhooks were found handling updates, suggesting this is an intentional design where mutations are only needed during resource creation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if Update operations are handled elsewhere or if this is intentional.

# Check for Update operation handlers in related files
rg -A 5 "DataEngineUpgradeManager.*Update" 

# Check validator for any Update operation handling
fd -e go -x grep -l "Update.*DataEngineUpgradeManager" {} \;

Length of output: 7308


Script:

#!/bin/bash
# Let's check the validator configuration to understand the complete admission control setup
rg -A 10 "DataEngineUpgradeManager.*validator" 

# Also check the CRD definition to understand the intended operations
fd "dataengineupgrademanager" -t f -e yaml | xargs cat

# Check for any upgrade-related logic in controllers
ast-grep --pattern 'type DataEngineUpgradeManagerController struct {
  $$$
}'

Length of output: 1311


Script:

#!/bin/bash
# Let's check the controller implementation to understand the upgrade workflow
rg -A 10 "func \(c \*DataEngineUpgradeManagerController\) (Create|Update|Sync|reconcile)" 

# Check if there are any other webhooks handling updates
fd -t f -e go -x grep -l "webhook.*DataEngineUpgradeManager" {} \;

# Check for any validation rules in the webhook
rg -A 5 "ValidatingWebhookConfiguration.*DataEngineUpgradeManager"

Length of output: 232

webhook/resources/nodedataengineupgrade/mutator.go (3)

1-19: LGTM: Well-organized imports and package structure

The imports are properly organized and grouped by external/internal dependencies.


54-62: Verify label mutation security implications

The code applies labels from GetNodeDataEngineUpgradeLabels without validation. Verify that these labels cannot be manipulated in ways that could affect system security.

✅ Verification successful

Let me analyze the implementation of GetNodeDataEngineUpgradeLabels to understand how the labels are constructed.


Let me check the label key constants and how they are used in the system.


Based on the analysis of the code, I can now generate the final response:

Label mutation is safely implemented with controlled values

The GetNodeDataEngineUpgradeLabels implementation shows that:

  1. Labels are constructed using a fixed prefix longhorn.io/
  2. Label keys are predefined constants
  3. Label values are derived from controlled inputs (upgradeManagerID and nodeID)
  4. Base labels include a managed-by field set to a constant control plane name

The label mutation is safe because:

  • All label keys are namespaced with the Longhorn prefix
  • Values are derived from existing resource fields without transformation
  • The implementation follows Kubernetes labeling best practices
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check how NodeDataEngineUpgrade labels are used in the system
rg -A 5 "GetNodeDataEngineUpgradeLabels"

# Look for label-based security controls
rg -l "selector.*label" 

Length of output: 17205


Script:

#!/bin/bash
# Check the implementation of GetBaseLabelsForSystemManagedComponent and related label functions
ast-grep --pattern 'func GetBaseLabelsForSystemManagedComponent() map[string]string {
  $$$
}'

# Check label key constants
rg -A 3 "const.*LonghornLabel"

Length of output: 343


Script:

#!/bin/bash
# Check all Longhorn label related constants and their usage
rg "LonghornLabel" types/types.go -A 2

# Check how these labels are validated or used in security contexts
ast-grep --pattern 'func $_($$_) {
  $$$
  label$$$
  $$$
}'

Length of output: 8496


30-41: Verify if Update/Delete operations should be supported

The mutator currently only supports Create operations. Given this is for live upgrades, verify if Update/Delete operations should also be handled.

✅ Verification successful

Let me check the controller implementation to understand how NodeDataEngineUpgrade resources are handled.


Only Create operations are required for NodeDataEngineUpgrade webhook

The current implementation is correct. The evidence shows that:

  • The controller handles status updates internally via UpdateNodeDataEngineUpgradeStatus
  • Status updates are implementation details and don't require webhook validation
  • The resource is designed to be immutable after creation, with only its status being updated by the controller
  • Delete operations are handled by standard Kubernetes garbage collection
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any Update/Delete operations on NodeDataEngineUpgrade resources
ast-grep --pattern 'Update($$$)' 
ast-grep --pattern 'Delete($$$)'

# Check if there are any controller implementations that might need these operations
rg -l "NodeDataEngineUpgrade.*Controller"

Length of output: 184


Script:

#!/bin/bash
# Check the controller implementation for update/delete operations
rg -A 10 "NodeDataEngineUpgrade.*Controller" controller/node_upgrade_controller.go

# Check for any status updates or deletions in the controller
rg "Update|Delete" controller/node_upgrade_controller.go

# Check the API type definition for any fields that might need updates
ast-grep --pattern 'type NodeDataEngineUpgrade struct {
  $$$
}'

Length of output: 5747

k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (5)

1-4: LGTM: Clean package declaration and imports

The package declaration and imports are appropriate for a Kubernetes custom resource definition.


37-53: LGTM: Well-structured CRD with comprehensive annotations

The resource definition includes:

  • Appropriate generation tags for client and deepcopy
  • Useful print columns for operational visibility
  • Proper storage version annotation
  • Standard Kubernetes resource structure

55-62: LGTM: Standard list type implementation

The list type follows the standard Kubernetes pattern for list resources.


13-19: Document possible upgrade states and messages

The structure would benefit from documentation describing:

  • Possible values for the State field
  • Expected format/content of the Message field

Let's verify the UpgradeState type definition:


1-62: Verify integration with upgrade controller

Since this CRD is central to the v2 data engine live upgrade feature, let's verify its integration with the upgrade controller:

✅ Verification successful

The search results show comprehensive integration of the DataEngineUpgradeManager CRD with the upgrade controller. Let me verify the controller implementation to ensure it properly handles the resource.


Let me verify the monitor implementation since it's a crucial part of the upgrade process.


Based on the comprehensive verification of the codebase, I can now provide a final response:

Integration with upgrade controller is properly implemented

The DataEngineUpgradeManager CRD is well-integrated with a dedicated controller and monitoring system:

  • Complete controller implementation with proper reconciliation logic
  • Dedicated monitor implementation that tracks upgrade status
  • Proper handling of ownership, state transitions, and cleanup
  • Status updates are properly propagated through the controller
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for DataEngineUpgradeManager usage in controllers
rg "DataEngineUpgradeManager" --type go -g "*controller*.go"

Length of output: 5805


Script:

#!/bin/bash
# Check the core reconciliation logic in the upgrade manager controller
ast-grep --pattern 'func (uc *DataEngineUpgradeManagerController) reconcile($$$) $$$'

Length of output: 6851


Script:

#!/bin/bash
# Check the DataEngineUpgradeManagerMonitor implementation
ast-grep --pattern 'type DataEngineUpgradeManagerMonitor struct {
  $$$
}'

# Also check its constructor
ast-grep --pattern 'func NewDataEngineUpgradeManagerMonitor($$$) $$$'

Length of output: 2412

webhook/server/mutation.go (1)

16-16: LGTM! Import statements follow project conventions.

The new imports for data engine upgrade resources are properly organized alongside other resource imports.

Also applies to: 21-21

engineapi/instance_manager_test.go (2)

1-4: LGTM!

The package declaration and imports are clean and follow Go best practices.


86-96: ⚠️ Potential issue

Fix map comparison logic in equalMaps.

The current implementation only checks if all keys in map 'a' exist in map 'b' with matching values, but doesn't verify that 'b' doesn't contain extra keys. This could lead to false positives.

Apply this diff to fix the map comparison:

 func equalMaps(a, b map[string]string) bool {
     if len(a) != len(b) {
         return false
     }
     for k, v := range a {
-        if b[k] != v {
+        if bv, exists := b[k]; !exists || bv != v {
             return false
         }
     }
     return true
 }

The length check at the start ensures both maps have the same number of entries, but the improved implementation makes the code more explicit and maintainable.

Likely invalid or redundant comment.

k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go (2)

71-78: LGTM!

The list type implementation follows Kubernetes conventions perfectly.


28-28: Verify DataEngineType dependency.

The DataEngineType type is used but not imported. Need to verify its definition and compatibility.

k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go (3)

1-26: LGTM: File header and imports are properly structured

The file follows standard Kubernetes code generation patterns with proper license header and necessary imports.


55-94: LGTM: Namespace-specific lister implementation is correct

The implementation correctly handles namespaced resources and follows Kubernetes patterns for error handling. Let's verify the error handling consistency across the codebase.

✅ Verification successful

Error handling and key construction patterns are consistent across the codebase

The verification shows that the error handling in NodeDataEngineUpgrade lister follows the standard Kubernetes patterns consistently used across all listers in the codebase:

  • All listers use errors.NewNotFound() with proper resource types for not-found errors
  • Key construction follows the standard namespace + "/" + name format across all namespaced resources
  • The implementation aligns with both v1beta1 and v1beta2 listers in the codebase
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent error handling patterns
# Expected: Find similar error handling patterns in other listers

# Search for similar error handling patterns in other listers
ast-grep --pattern 'errors.NewNotFound($_, $_)'

# Search for key construction patterns
rg 'GetByKey\(.*\+.*\)' --type go

Length of output: 12059


28-53: LGTM: NodeDataEngineUpgradeLister implementation follows Kubernetes patterns

The implementation correctly uses cache.Indexer and follows standard Kubernetes lister patterns. However, let's verify the resource type registration.

k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (3)

1-26: LGTM: Standard package setup and imports

The file header, license, and imports are properly structured following Kubernetes conventions.


28-35: LGTM: Well-structured interface with extension support

The interface follows Kubernetes best practices with:

  • Clear method documentation
  • Standard List method with label selector support
  • Namespace-specific lister access
  • Extension point via DataEngineUpgradeManagerListerExpansion

84-94: LGTM: Proper error handling in Get method

The Get method follows best practices with:

  • Proper key construction for indexer lookup
  • Comprehensive error handling
  • Use of standard Kubernetes NotFound error type
k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go (3)

17-17: Note: This is generated code

This file is automatically generated by informer-gen. Any changes should be made to the source generation configuration rather than directly to this file.


80-82: LGTM: Proper namespace indexing

The default informer correctly sets up namespace indexing, which is crucial for the live upgrade feature to properly scope upgrade operations to specific namespaces.


61-66: Verify context handling in List/Watch operations

The implementation uses context.TODO() for List/Watch operations. While this is common in generated code, verify that the parent controller properly handles context cancellation for cleanup during upgrades.

Also applies to: 67-72

✅ Verification successful

Context handling is properly implemented in the monitoring system

The verification shows that context handling is properly implemented:

  1. The base monitor struct maintains a context.Context and CancelFunc
  2. The NodeDataEngineUpgradeMonitor uses PollUntilContextCancel which properly handles context cancellation
  3. The monitor is started with a context from the parent controller system

The use of context.TODO() in the informer is standard practice for generated code, as the actual context handling is implemented in the controller/monitor layer where cleanup operations occur.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for context handling in the parent controller
ast-grep --pattern 'func \($_\) \($_\)(\$_, context.Context, \$_) {
  $$$
}'

Length of output: 78


Script:

#!/bin/bash
# Let's search for NodeDataEngineUpgrade controller implementations and context usage
rg -l "NodeDataEngineUpgrade" --type go

# Then search for context handling patterns in informer/controller code
rg "context\." -A 3 -B 3 --type go | grep -A 3 -B 3 "NodeDataEngineUpgrade"

# Also check for general controller patterns with context
ast-grep --pattern 'type $_Controller struct {
  $$$
}'

# And search for Run/Start methods that might handle context
rg "func.*Run|func.*Start" --type go | grep -A 3 "context"

Length of output: 70035


Script:

#!/bin/bash
# Let's search for NodeDataEngineUpgrade controller and monitor implementations
rg -A 10 "type.*NodeDataEngineUpgrade.*Controller struct" --type go

# Check for context handling in node upgrade monitor
rg -A 10 "type.*NodeDataEngineUpgradeMonitor struct" --type go

# Look for context usage in upgrade-related code
rg "context\." -A 3 -B 3 "node.*upgrade" --type go --ignore-case

Length of output: 1689


Script:

#!/bin/bash
# Let's check how context is handled in the node upgrade controller and monitor
rg -A 5 "func.*Run|func.*Start" controller/node_upgrade_controller.go controller/monitor/node_upgrade_monitor.go

# Check for context usage in the controller implementation
rg "context\." -A 3 controller/node_upgrade_controller.go

# Look at the base monitor implementation for context handling
rg -A 10 "type.*baseMonitor struct" --type go

Length of output: 3072

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go (2)

79-81: LGTM! Implementation follows the established pattern.

The NodeDataEngineUpgrades method is correctly implemented following the client-gen pattern and maintains consistency with other methods in the file.

Let's verify the existence of the corresponding fake implementation:

✅ Verification successful

Implementation is correctly structured and complete

The fake implementation exists at k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go and follows the established pattern in the codebase. The implementation is properly referenced in both the client file and has its dedicated fake implementation file, consistent with other resources in the same package.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the fake implementation file exists
fd -e go "fake_nodedataengineupgrades\.go$" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

Length of output: 109


Script:

#!/bin/bash
# Let's try a broader search to understand the structure and naming patterns
fd -e go . k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

# Also search for references to NodeDataEngineUpgrade in the codebase
rg -l "NodeDataEngineUpgrade" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

Length of output: 2529


59-61: LGTM! Implementation follows the established pattern.

The DataEngineUpgradeManagers method is correctly implemented following the client-gen pattern and maintains consistency with other methods in the file.

Let's verify the existence of the corresponding fake implementation:

✅ Verification successful

The fake implementation exists and follows the correct pattern

The verification confirms that:

  • The fake implementation file fake_dataengineupgrademanager.go exists in the correct location
  • The file follows the same naming convention as other fake implementations
  • Both the client and implementation files contain the necessary DataEngineUpgradeManager references
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the fake implementation file exists
fd -e go "fake_dataengineupgrademanagers\.go$" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

Length of output: 112


Script:

#!/bin/bash
# Let's try a broader search to understand the fake implementations structure
fd -e go "fake_.*\.go$" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

Length of output: 2198


Script:

#!/bin/bash
# Let's also check the content of the directory to ensure we're looking at the right place
ls -la k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

Length of output: 2000


Script:

#!/bin/bash
# Let's search for DataEngineUpgradeManager references in the fake directory
rg -l "DataEngineUpgradeManager" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/

Length of output: 279

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go (4)

36-36: LGTM: Interface additions follow Kubernetes client-go patterns

The new getter interfaces DataEngineUpgradeManagersGetter and NodeDataEngineUpgradesGetter are correctly added to the LonghornV1beta2Interface, following standard Kubernetes client-go patterns and maintaining alphabetical order.

Also applies to: 41-41


88-90: LGTM: DataEngineUpgradeManagers implementation is correct

The implementation follows the standard Kubernetes client-go patterns, properly handling namespace scoping and using the factory pattern consistently with other resource types.


108-110: LGTM: NodeDataEngineUpgrades implementation is correct

The implementation follows the standard Kubernetes client-go patterns, properly handling namespace scoping and using the factory pattern consistently with other resource types.


Line range hint 36-110: Verify the presence of required supporting files

Since this is a generated file, let's verify the presence of all required supporting files for the new resource types.

✅ Verification successful

Let me verify the factory functions more thoroughly since the previous search pattern might have been too strict.


All required supporting files are present and properly integrated

The verification confirms the presence of all necessary components:

  • Resource type definitions exist in k8s/pkg/apis/longhorn/v1beta2/
  • Interface definitions are present in the client package
  • Factory functions are correctly implemented in their respective files
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for the presence of required supporting files for new resource types

echo "Checking for resource type definitions..."
fd -e go "dataengineupgrademanager" k8s/pkg/apis/
fd -e go "nodedataengineupgrade" k8s/pkg/apis/

echo "Checking for factory functions..."
fd -e go "new.*dataengineupgrademanager" k8s/pkg/client/
fd -e go "new.*nodedataengineupgrade" k8s/pkg/client/

echo "Checking for interface definitions..."
rg -l "type (DataEngineUpgradeManager|NodeDataEngineUpgrade)Interface" k8s/pkg/client/

Length of output: 837


Script:

#!/bin/bash
# Check for factory function implementations with a more flexible pattern
echo "Checking factory functions in dataengineupgrademanager.go..."
rg "func new.*DataEngineUpgradeManager" k8s/pkg/client/

echo "Checking factory functions in nodedataengineupgrade.go..."
rg "func new.*NodeDataEngineUpgrade" k8s/pkg/client/

Length of output: 745

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go (2)

39-41: LGTM: Resource and kind definitions are correct.

The GroupVersionResource and GroupVersionKind definitions follow Kubernetes API conventions and align with the v2 data engine upgrade feature.


33-37: LGTM: Well-structured fake client implementation.

The FakeDataEngineUpgradeManagers implementation follows client-go patterns and provides a comprehensive testing infrastructure for the v2 data engine upgrade feature. The implementation includes all necessary CRUD operations with proper error handling and type safety.

Let's verify the consistency of the resource definition across the codebase:

✅ Verification successful

Let me gather more information to verify the complete implementation:


LGTM: Fake client implementation is complete and consistent

The fake client implementation correctly implements all methods defined in the DataEngineUpgradeManagerInterface, including CRUD operations (Create, Read, Update, Delete), List, Watch, Patch, and UpdateStatus. The implementation follows the standard client-go patterns and properly handles type safety and error cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify resource definition consistency

# Check for the actual CRD definition
rg -A 5 "kind: CustomResourceDefinition.*dataengineupgrademanagers" 

# Check for related controller implementations
ast-grep --pattern 'type DataEngineUpgradeManager struct {
  $$$
}'

Length of output: 819


Script:

#!/bin/bash
# Check for the fake client implementation completeness
rg -A 10 "func \(c \*FakeDataEngineUpgradeManagers\)" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go

# Check for the real client implementation
rg -A 10 "func \(c \*dataengineupgrademanagers\)" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go

# Check for the interface definition
ast-grep --pattern 'type DataEngineUpgradeManagerInterface interface {
  $$$
}'

Length of output: 6946

k8s/pkg/apis/longhorn/v1beta2/node.go (2)

16-31: LGTM! Verify constant usage consistency.

The new node condition reason constant follows the established naming convention and is appropriately placed within the node condition reasons group.

Let's verify the consistent usage of this constant across the codebase:

✅ Verification successful

✓ Constant usage is consistent and properly integrated

The constant NodeConditionReasonNodeDataEngineUpgradeRequested is correctly used in the node controller to set the reason when a node's data engine upgrade is requested. The implementation follows the same pattern as other node condition reasons, where it:

  • Sets disableScheduling to true
  • Sets the reason using the constant
  • Provides a descriptive message
  • Updates node conditions accordingly
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of the new constant to ensure it's properly integrated
rg "NodeConditionReasonNodeDataEngineUpgradeRequested" --type go

# Search for related condition checks to ensure consistent condition handling
ast-grep --pattern 'if $cond.Reason == $reason {
  $$$
}'

Length of output: 356


Script:

#!/bin/bash
# Let's check the context of how this constant is used in node_controller.go
rg -B 5 -A 5 "NodeConditionReasonNodeDataEngineUpgradeRequested" controller/node_controller.go

# Let's also check for similar condition reason handling patterns
rg "Reason = string\(.*NodeConditionReason.*\)" --type go

Length of output: 598


149-151: Consider versioning and migration strategy.

As this adds a new field to the Node CRD:

  1. Ensure backward compatibility with older versions
  2. Consider documenting the upgrade path for existing deployments
  3. Verify that the node controller handles the absence of this field gracefully

Let's check the CRD versioning and conversion setup:

✅ Verification successful

Let me gather more specific information about the conversion setup and version handling.


Let me check for any existing conversion handling for this new field.


Let me check for the conversion webhook implementation to verify how new fields are handled.


Based on the evidence gathered, I can now provide a final response:

Versioning and migration strategy is properly handled.

The CRD has proper webhook conversion setup between v1beta1 and v1beta2 versions as shown in the node.yaml configuration. The new field being optional (+optional) means:

  1. It will default to false for older versions
  2. The conversion webhook will handle the field appropriately between versions
  3. Existing deployments will continue to work without modification
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for CRD conversion webhook configurations
rg "conversion:" --type yaml

# Look for version handling in controllers
ast-grep --pattern 'type $name struct {
  $$$
  scheme *runtime.Scheme
  $$$
}'

Length of output: 510


Script:

#!/bin/bash
# Check the detailed conversion webhook configuration for node CRD
cat k8s/patches/crd/webhook/node.yaml

# Look for version conversion code specific to nodes
rg "Convert.*Node" -A 5

# Check for existing handling of optional fields in node controller
rg "DataEngineUpgradeRequested" -A 3

Length of output: 108887


Script:

#!/bin/bash
# Check for conversion code specific to Node CRD
cat k8s/patches/crd/webhook/node.yaml

# Look for any existing conversion handling in the codebase
rg "Convert.*Node.*v1beta" -A 5

Length of output: 5879


Script:

#!/bin/bash
# Look for the conversion webhook implementation
rg -g "*.go" "type.*ConversionHandler" -A 10

# Check for any existing conversion logic for Node CRD
rg -g "*.go" "Convert.*Node.*v1beta[12]" --no-heading

Length of output: 1984

k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (3)

239-239: LGTM: Helpful addition to kubectl output

The new printcolumn for Image will improve observability when using kubectl to inspect instance managers.


61-67: LGTM: New instance state for upgrade support

The addition of InstanceStateSuspended state aligns well with the live upgrade feature requirements. The state will be useful for temporarily suspending instances during the upgrade process.

Let's verify the state usage in the codebase:

✅ Verification successful

The suspended state is properly integrated and used in the codebase

The verification shows that InstanceStateSuspended is well-integrated into the codebase:

  • Used in volume controller for handling live upgrades
  • Properly handled in instance handler with specific logic for v2 data engine
  • Integrated with engine controller's instance management logic
  • Has proper state transition handling (suspend/resume operations)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for handlers of the suspended state
rg "InstanceStateSuspended" -A 5

Length of output: 4032


Line range hint 1-257: Verify CRD consistency across the upgrade feature

Let's ensure the instance manager changes are consistent with related CRDs:

✅ Verification successful

Based on the verification results, I can now provide the final response since I have sufficient information from the shell script output. The search revealed that the target-related fields (targetNodeID, TargetIP, StorageTargetIP) are properly defined in both the CRD (k8s/crds.yaml) and the Go types (k8s/pkg/apis/longhorn/v1beta2/instancemanager.go), showing consistency across the upgrade feature.

CRD definitions are properly aligned with type definitions

The instance manager changes maintain consistency with related CRDs:

  • All target-related fields are properly defined in CRDs with matching types
  • Field definitions in Go types match their CRD counterparts
  • The upgrade feature maintains backward compatibility with existing fields
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for related CRDs that might need similar fields
rg -t yaml "kind:\s+(Custom)?ResourceDefinition" | while read -r crd; do
  echo "=== Checking CRD: $crd ==="
  rg -A 10 "targetNodeID|TargetIP|StorageTargetIP" "$(dirname "$crd")"
done

Length of output: 111429

k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go (1)

77-83: LGTM! The new interfaces follow the established patterns.

The added interfaces for DataEngineUpgradeManager and NodeDataEngineUpgrade are consistent with the existing interface patterns and naming conventions in the file.

Let's verify that this file was properly generated:

Also applies to: 117-123

✅ Verification successful

✓ Generated code verified

The file contains the standard Kubernetes code generation marker "Code generated by lister-gen. DO NOT EDIT." confirming it's properly generated and not meant for manual edits.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the file matches the expected generated output
# Look for the code-generator configuration and verify its output

# Find code-generator configuration
fd -t f "generate-groups.sh|hack/generate.sh"

# Look for lister-gen configuration
rg -l "lister-gen" .

Length of output: 14625


Script:

#!/bin/bash
# Let's verify if this file is generated by looking for code generation configuration
# and checking the file header

# Look for code generation configuration in hack directory
fd -t f "generate-groups.sh|generate.sh" hack/

# Check the file header to confirm it's generated
head -n 5 k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go

Length of output: 290


Script:

#!/bin/bash
# Let's check for the code generation marker and search for the generation script in the repository

# Check for the code generation marker
rg -A 1 "Code generated by" k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go

# Search for any Kubernetes code generation related files
fd -t f "update-codegen.sh|codegen.sh"

Length of output: 180

k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (2)

41-42: LGTM! Interface additions follow the established pattern.

The new informer methods are correctly added to the Interface type, maintaining alphabetical order and following the consistent pattern of other informer declarations.

Also applies to: 51-52


123-126: LGTM! Implementation follows the standard informer pattern.

The implementations of DataEngineUpgradeManagers() and NodeDataEngineUpgrades() correctly follow the established pattern:

  • Proper initialization using factory, namespace, and tweakListOptions
  • Consistent return type initialization
  • Matches the pattern of other informer implementations

Let's verify the existence of the required informer types:

Also applies to: 148-151

✅ Verification successful

Let me verify the implementation details of these informers to ensure complete consistency.


Verified: Informer implementations are correctly structured and consistent

The verification confirms that both informers follow the standard Kubernetes informer pattern with:

  • Properly defined interfaces (DataEngineUpgradeManagerInformer and NodeDataEngineUpgradeInformer)
  • Consistent struct implementations with required fields (factory, tweakListOptions, namespace)
  • Standard informer methods (Informer and Lister)
  • Matching implementation structure between both informer types
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of required informer type definitions

# Check for DataEngineUpgradeManagerInformer type
rg -l "type DataEngineUpgradeManagerInformer interface" .

# Check for NodeDataEngineUpgradeInformer type
rg -l "type NodeDataEngineUpgradeInformer interface" .

# Check for corresponding informer struct definitions
rg -l "type (dataEngineUpgradeManagerInformer|nodeDataEngineUpgradeInformer) struct" .

Length of output: 555


Script:

#!/bin/bash
# Check the implementation details of both informer types
rg -A 10 "type DataEngineUpgradeManagerInformer interface" ./k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go
rg -A 10 "type NodeDataEngineUpgradeInformer interface" ./k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go

# Check the struct implementations
rg -A 5 "type dataEngineUpgradeManagerInformer struct" ./k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go
rg -A 5 "type nodeDataEngineUpgradeInformer struct" ./k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go

Length of output: 1615

k8s/pkg/client/informers/externalversions/generic.go (1)

103-104: LGTM! The new informer cases are properly integrated.

The additions for dataengineupgrademanagers and nodedataengineupgrades follow the established patterns and are correctly placed within the v1beta2 version group.

Let's verify that these new resources are properly registered in the scheme:

Also applies to: 113-114

✅ Verification successful

The informer cases are properly integrated and supported by CRD definitions

The verification confirms:

  • Both dataengineupgrademanagers and nodedataengineupgrades are properly defined as CustomResourceDefinitions in k8s/crds.yaml
  • The CRDs follow Kubernetes conventions with correct group (longhorn.io), plural names, and scope
  • The informer factory setup in generic.go correctly maps these resources to their respective informers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the new resources are registered in the scheme

# Check for resource registration in scheme
rg -A 5 "AddToScheme" | grep -E "dataengineupgrademanagers|nodedataengineupgrades"

# Check for corresponding CRD definitions
fd -e yaml -e yml | xargs grep -l "kind: CustomResourceDefinition" | xargs grep -l -E "dataengineupgrademanagers|nodedataengineupgrades"

Length of output: 235


Script:

#!/bin/bash
# Let's examine the CRD definitions and scheme registration more thoroughly

# Check the CRD definitions
cat k8s/crds.yaml | grep -A 10 -E "dataengineupgrademanagers|nodedataengineupgrades"

# Look for scheme registration in generated code
rg "func (v1beta2|Register)" k8s/pkg/apis/longhorn/v1beta2/ -A 5

# Check informer factory registration
rg "NewSharedInformerFactory" k8s/pkg/client/informers/externalversions/ -A 5

Length of output: 3565

controller/controller_manager.go (1)

157-165: LGTM! Controller initialization follows established patterns.

The initialization of both new controllers follows the consistent pattern used throughout the codebase, with proper error handling.

Let's verify the controller definitions:

✅ Verification successful

Let me verify the controller execution segment mentioned in the review comment by checking how these controllers are started.


Let me fix the regex and try again to verify the controller execution.


Let me check for the other controller's execution as well.


Both controllers are properly defined and integrated into the controller manager

The verification confirms:

  • Both controllers follow the standard structure with proper inheritance from baseController
  • Constructor functions accept consistent parameters (logger, datastore, scheme, kubeClient, controllerID, namespace)
  • Controllers are correctly started with the standard pattern using goroutines and the same number of workers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the new controller definitions exist and follow the standard pattern

# Check for the controller definitions
rg -A 5 'type (DataEngineUpgradeManagerController|NodeDataEngineUpgradeController) struct'

# Check for the New constructor functions
rg -A 5 'func New(DataEngineUpgradeManagerController|NodeDataEngineUpgradeController)'

Length of output: 1831


Script:

#!/bin/bash
# Check how the controllers are started in the manager
rg -A 2 'go func\(\) {.*dataEngineUpgrade.*Controller\.Run\(' controller/controller_manager.go

Length of output: 243


Script:

#!/bin/bash
# Check how the controllers are started in the manager
rg "dataEngineUpgrade.*Controller" controller/controller_manager.go

Length of output: 268


Script:

#!/bin/bash
# Check how the node upgrade controller is started
rg "nodeDataEngine.*Controller" controller/controller_manager.go

Length of output: 256

k8s/pkg/apis/longhorn/v1beta2/volume.go (1)

358-359: 🛠️ Refactor suggestion

Add documentation and verify architectural consistency

The new status field needs:

  1. Documentation explaining its role in tracking the upgrade process
  2. Clarification of its relationship with spec.TargetNodeID
  3. Verification of consistency with other node-related status fields

Add documentation and consider renaming for clarity:

 	// +optional
+	// CurrentTargetNodeID indicates the node that is currently targeted for data engine live upgrade.
+	// This field reflects the actual state of the upgrade process initiated by spec.TargetNodeID.
 	CurrentTargetNodeID string `json:"currentTargetNodeID"`

Let's verify the usage of these new fields in the codebase:

datastore/datastore.go (4)

93-96: LGTM: New fields follow established naming patterns

The new fields for data engine upgrades are correctly structured and follow the existing field naming patterns in the DataStore struct.


186-189: LGTM: Proper informer initialization and cache sync

The initialization of new informers follows the established pattern:

  1. Informer creation from factory
  2. Cache sync registration

279-282: LGTM: Consistent lister and informer assignments

The assignments of listers and informers in the NewDataStore constructor follow the existing pattern and maintain consistency with other resource types.


Line range hint 1-324: Verify the impact on existing upgrade workflows

Since this change introduces new components for data engine upgrades, we should verify that it doesn't conflict with existing upgrade mechanisms.

✅ Verification successful

Let me analyze the upgrade-related files more deeply to understand potential conflicts.


Based on the extensive code analysis, I can now provide a final response regarding the impact of the data engine upgrade components on existing upgrade workflows.

No conflicts with existing upgrade mechanisms

The new data engine upgrade components (DataEngineUpgradeManager and NodeDataEngineUpgrade) are completely separate from the existing upgrade mechanisms and have been carefully designed to work alongside them:

  • The data engine upgrade is specifically for transitioning between data engine versions (v1 to v2) and operates independently from other upgrade workflows
  • The implementation includes proper safeguards through:
    • Dedicated controllers and monitors that handle only data engine upgrades
    • Clear state management and validation in webhooks
    • Node scheduling protection during upgrades
    • Volume attachment restrictions during the upgrade process

The code shows that the data engine upgrade process has been implemented as an entirely new workflow that doesn't interfere with or modify any of the existing upgrade code paths.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for existing upgrade-related code patterns
echo "Searching for existing upgrade patterns..."
rg -l "upgrade" --type go

# Look for potential conflicts in upgrade-related functions
echo "Checking for upgrade-related functions..."
ast-grep --pattern 'func $_($$) {
  $$$
  upgrade
  $$$
}'

Length of output: 9260


Script:

#!/bin/bash
# Check existing upgrade manager implementation
echo "Checking existing upgrade manager implementation..."
rg -A 10 "type UpgradeManager" --type go

# Check upgrade workflow in controllers
echo "Checking upgrade workflow in controllers..."
rg -A 10 "func.*[Uu]pgrade" controller/upgrade_manager_controller.go

# Look for data engine related upgrade code
echo "Looking for data engine upgrade related code..."
rg -A 5 "DataEngine.*[Uu]pgrade" --type go

Length of output: 100983

types/types.go (1)

163-164: LGTM!

The new label constants are well-defined and align with the data engine upgrade feature requirements.

Also applies to: 193-193

controller/uninstall_controller.go (2)

49-50: LGTM: Constants follow established naming pattern

The new CRD name constants are well-defined and consistent with the existing naming convention.


649-662: LGTM: Resource deletion methods follow established patterns

The new deletion methods for DataEngineUpgradeManager and NodeDataEngineUpgrade resources:

  • Follow consistent error handling patterns
  • Include appropriate logging
  • Handle "not found" cases correctly
  • Use the same deletion workflow as other resources

Also applies to: 1186-1228

k8s/crds.yaml (5)

1313-1408: Well-structured DataEngineUpgradeManager CRD definition

The new CRD for managing data engine upgrades is well-designed with:

  • Clear separation of upgrade management responsibilities
  • Proper status tracking fields
  • Informative printer columns for kubectl output

2358-2446: Well-structured NodeDataEngineUpgrade CRD definition

The new CRD for node-level data engine upgrades is well-designed with:

  • Clear node targeting mechanism
  • Volume-level status tracking
  • Proper integration with DataEngineUpgradeManager

2569-2571: Clear upgrade request field in Node CRD

The addition of dataEngineUpgradeRequested with a clear description helps control the upgrade process.


1785-1786: Consistent target node fields across Engine and Replica CRDs

The addition of target-related fields (targetNodeID, storageTargetIP, targetIP) is consistent between Engine and Replica CRDs, which is important for maintaining the relationship between these resources during upgrades.

Also applies to: 2004-2007, 3231-3232, 3295-3298


2105-2108: Improved InstanceManager observability

The addition of the image printer column improves observability by making the image information readily available in kubectl output.

webhook/resources/dataengineupgrademanager/validator.go (1)

41-53: Validation logic in Create method is appropriate

The Create method correctly validates that DataEngine is set to the supported type longhorn.DataEngineTypeV2. The validation ensures that only supported data engines are allowed during resource creation.

webhook/resources/nodedataengineupgrade/validator.go (5)

17-20: Struct Implementation is Correct

The nodeDataEngineUpgradeValidator struct correctly embeds admission.DefaultValidator and includes the necessary DataStore reference.


22-24: Validator Constructor Function is Well-Defined

The NewValidator function properly instantiates and returns a new validator instance with the provided DataStore.


26-38: Resource Metadata is Accurately Specified

The Resource method precisely defines the resource's metadata, including name, scope, API group, version, object type, and supported operation types.


40-66: Create Method Validates Required Fields Effectively

The Create method thoroughly checks for mandatory fields and validates their values, ensuring that:

  • Spec.NodeID is not empty.
  • Spec.DataEngine is of type DataEngineTypeV2.
  • Spec.InstanceManagerImage is provided.
  • Spec.DataEngineUpgradeManager is specified.

68-95: Update Method Enforces Field Immutability Appropriately

The Update method correctly ensures that critical fields remain immutable during updates by comparing the old and new objects:

  • Spec.NodeID
  • Spec.DataEngine
  • Spec.InstanceManagerImage
  • Spec.DataEngineUpgradeManager
controller/upgrade_manager_controller.go (2)

189-199: Handle potential conflict when updating OwnerID

When updating upgradeManager.Status.OwnerID, if a conflict occurs, the error is ignored with the assumption that "we don't mind others coming first." This might lead to unexpected behavior if multiple controllers contend for ownership.

Ensure that this behavior is acceptable in the context of the application. If not, consider handling the conflict more explicitly.


71-71: ⚠️ Potential issue

Correct event source namespace

The event recorder is initialized with an empty namespace (""), which may cause events to be recorded in the wrong namespace.

Update the event source to use the controller's namespace:

eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{
    Interface: v1core.New(kubeClient.CoreV1().RESTClient()).Events(uc.namespace),
})

Likely invalid or redundant comment.

webhook/resources/volume/mutator.go (1)

211-218: Ensure consistent instance manager image selection during upgrades

In the Create method, when setting the defaultImage for DataEngineTypeV2 volumes, the code updates it based on the active instance manager image. This approach relies on the current state of instance managers, which may vary across the cluster.

To avoid potential inconsistencies:

  • Confirm that the selected activeInstanceManagerImage is compatible with the volume being created.
  • Consider adding validation to ensure that all non-default instance managers are using the same image before assigning it to defaultImage.

This helps maintain consistency and prevents issues during live upgrades.

Run the following script to check for multiple active instance manager images:

This script counts the number of instance managers per image. Ideally, there should be only one active image in use.

controller/instance_handler_test.go (8)

Line range hint 39-49: Method signature updated for GetInstance

The GetInstance method now includes the isInstanceOnRemoteNode parameter to reflect interface changes. The mock implementation remains appropriate.


Line range hint 51-61: Method signature updated for CreateInstance

The CreateInstance method now includes the isInstanceOnRemoteNode parameter, aligning with interface updates. The mock implementation functions correctly.


75-77: Unimplemented method SuspendInstance returns indicative error

The SuspendInstance method returns an error indicating it is not mocked, which is acceptable for an unimplemented mock method.


79-81: Unimplemented method ResumeInstance returns indicative error

Similarly, the ResumeInstance method correctly returns an error indicating it is not mocked.


83-85: Unimplemented method SwitchOverTarget returns indicative error

This method returns an error to indicate it is not mocked, consistent with other unimplemented methods.


92-95: Implementation of IsEngine method is correct

The IsEngine method appropriately checks if the object is of type *longhorn.Engine.


101-104: Implementation of RequireRemoteTargetInstance method

The method returns false, nil, which is acceptable if no remote target instance is required in this mock context.


137-141: Additional fields in newEngine function initialized correctly

The TargetIP, StorageTargetIP, and TargetPort fields are appropriately set. Setting TargetPort to 0 for v1 volumes aligns with the comment and ensures backward compatibility.

webhook/resources/volume/validator.go (1)

388-393: Potential logic error when validating instance manager image

The validation checks at lines 388-393 may have a logical issue:

  • At line 388, if oldVolume.Spec.Image == instanceManagerImage, it returns an error stating the volume is already using the instance manager image.
  • Then it checks if newVolume.Spec.Image != instanceManagerImage and returns an error if it doesn't match.

This could prevent valid updates. Verify that this logic correctly enforces the intended constraints.

Please review the conditions to ensure they do not unintentionally block valid image updates.

controller/replica_controller.go (4)

Line range hint 316-354: Addition of isInstanceOnRemoteNode parameter in CreateInstance function

The CreateInstance function now includes the new parameter isInstanceOnRemoteNode, and it is correctly integrated into the function logic, particularly when calling GetInstanceManagerByInstanceRO. This enhancement aligns with updates in other controllers and improves instance management across remote nodes.


Line range hint 355-368: Correct usage of isInstanceOnRemoteNode in instance manager retrieval

The parameter isInstanceOnRemoteNode is appropriately passed to GetInstanceManagerByInstanceRO, ensuring that the correct instance manager is retrieved based on the instance’s node location.


631-634: Validation of instance type in IsEngine method is appropriate

The IsEngine method correctly checks if the provided object is of type *longhorn.Engine, which is logical for type assertions within the ReplicaController.


Line range hint 636-673: Integration of isInstanceOnRemoteNode parameter in GetInstance function

The GetInstance function has been updated to include the isInstanceOnRemoteNode parameter, and it is consistently used when retrieving the instance manager via GetInstanceManagerByInstanceRO. This change enhances the function’s ability to manage instances accurately based on their node location.

controller/monitor/node_upgrade_monitor.go (2)

58-64: Ensure thread-safe access to shared data to prevent race conditions

The NodeDataEngineUpgradeMonitor struct contains shared data like collectedData and nodeUpgradeStatus accessed by multiple goroutines. Although there are mutex locks (Lock and Unlock), ensure that all accesses to shared variables are properly synchronized to prevent data races.


145-146: Review the initialization logic in handleUpgradeStateUndefined

The method handleUpgradeStateUndefined transitions the state to Initializing without additional logic. Confirm that no other initialization steps are required at this point.

engineapi/instance_manager.go (3)

890-891: Fix typo in error message: 'date' should be 'data'

The error message contains a typo: 'date engine' should be 'data engine'.


908-909: Fix typo in error message: 'date' should be 'data'

The error message contains a typo: 'date engine' should be 'data engine'.


924-925: Fix typo in error message: 'date' should be 'data'

The error message contains a typo: 'date engine' should be 'data engine'.

controller/instance_handler.go (1)

927-931: ⚠️ Potential issue

Add missing defer client.Close() to prevent resource leak

After creating the InstanceManagerClient at line 928, the client should be properly closed to release resources. Missing the defer client.Close() call may lead to resource leaks.

Apply this diff to fix the issue:

Likely invalid or redundant comment.

scheduler/replica_scheduler.go (1)

744-747: Verify the impact of adding the false parameter to GetInstanceManagerByInstanceRO

The method call GetInstanceManagerByInstanceRO(r, false) now includes an additional boolean parameter false. Please ensure that this change aligns with the expected behavior of the function. Specifically, verify that passing false does not unintentionally alter instance manager retrieval logic, especially in edge cases.

controller/node_controller.go (1)

2177-2190: Appropriate handling of node schedulability during data engine upgrades

The addition of the condition else if node.Spec.DataEngineUpgradeRequested correctly ensures that the node is marked as unschedulable when a data engine upgrade is in progress. This prevents new workloads from being scheduled on the node during the upgrade process, aligning with the intended functionality.

k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (6)

981-1007: DeepCopy Functions for DataEngineUpgradeManager

The DeepCopyInto, DeepCopy, and DeepCopyObject methods for DataEngineUpgradeManager are correctly implemented, ensuring proper deep copy functionality.


1010-1061: DeepCopy Functions for DataEngineUpgradeManagerList and Spec

The methods for DataEngineUpgradeManagerList and DataEngineUpgradeManagerSpec are properly generated, with accurate handling of slices and maps in the deep copy processes.


1783-1810: DeepCopy Functions for NodeDataEngineUpgrade

The DeepCopyInto, DeepCopy, and DeepCopyObject methods for NodeDataEngineUpgrade are appropriately implemented, facilitating correct deep copying of the object.


1812-1890: DeepCopy Functions for NodeDataEngineUpgradeList, Spec, and Status

The deep copy methods for NodeDataEngineUpgradeList, NodeDataEngineUpgradeSpec, and NodeDataEngineUpgradeStatus are accurately generated, ensuring all fields, including maps and slices, are correctly copied.


2991-3006: DeepCopy Functions for UpgradeNodeStatus

The DeepCopyInto and DeepCopy methods for UpgradeNodeStatus are correctly implemented, providing proper deep copy capabilities for this struct.


3326-3341: DeepCopy Functions for VolumeUpgradeStatus

The generated deep copy methods for VolumeUpgradeStatus are correctly implemented, ensuring that instances of this struct can be safely duplicated.

controller/engine_controller.go (1)

708-709: Typographical error in error message

There is a typo in the error message within SwitchOverTarget. The message reads "invalid object for target switchover", it might be clearer if we say "Invalid object provided for target switchover".

[typographical_issue]

Correct the error message:

 return fmt.Errorf("invalid object provided for target switchover: %v", obj)
datastore/longhorn.go (4)

Line range hint 1801-1805: Correct usage of GetInstanceManagerByInstanceRO with additional parameter

The code change correctly updates the function call to include the new parameter false.


3761-3764: Inconsistent handling of isInstanceOnRemoteNode for Replicas

The isInstanceOnRemoteNode parameter is considered when the object is a *longhorn.Engine, but not when it's a *longhorn.Replica. If this is intentional, please ensure it's documented. Otherwise, consider whether the logic should also apply to Replicas.


3995-3997: Add check for empty imageName parameter

Good addition of input validation to check for empty imageName.


5998-6005: Add new method IsNodeDataEngineUpgradeRequested

The implementation correctly retrieves the node and returns the DataEngineUpgradeRequested flag.

k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go Outdated Show resolved Hide resolved
Comment on lines +55 to +56
dataengineupgrademanager.NewMutator(ds),
nodedataengineupgrade.NewMutator(ds),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Missing mutator implementations for new resources

The mutators are referenced in webhook/server/mutation.go, but the required implementations are missing:

  • webhook/resources/dataengineupgrademanager/mutator.go - file exists but lacks the Mutate method implementation
  • webhook/resources/nodedataengineupgrade/mutator.go - file exists but lacks the Mutate method implementation

These mutators need to be properly implemented following the existing patterns before they can be added to the mutation server.

🔗 Analysis chain

Verify mutator implementations for new resources.

The mutators are correctly added following the existing pattern. However, based on previous learnings, other custom resources typically don't implement the Mutate method.

Let's verify the implementation of these new mutators:


Let me try a different approach to verify the mutator implementations.


Let me try one final verification to check if these mutators exist in a different location or with a different structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the new resources implement Mutate method and compare with existing patterns

# Check for Mutate method implementation in new resources
echo "Checking DataEngineUpgradeManager mutator:"
ast-grep --pattern 'func (m *Mutator) Mutate($$$)' webhook/resources/dataengineupgrademanager/mutator.go

echo "Checking NodeDataEngineUpgrade mutator:"
ast-grep --pattern 'func (m *Mutator) Mutate($$$)' webhook/resources/nodedataengineupgrade/mutator.go

# Compare with existing patterns
echo "Checking other resource mutators for reference:"
fd mutator.go -t f -x ast-grep --pattern 'func (m *Mutator) Mutate($$$)' {}

Length of output: 579


Script:

#!/bin/bash
# Let's check if the mutator files exist and examine their content
echo "Checking DataEngineUpgradeManager mutator content:"
rg -A 10 "type Mutator struct" webhook/resources/dataengineupgrademanager/mutator.go || echo "File not found"

echo -e "\nChecking NodeDataEngineUpgrade mutator content:"
rg -A 10 "type Mutator struct" webhook/resources/nodedataengineupgrade/mutator.go || echo "File not found"

# Let's also check for any Mutate function in these files
echo -e "\nSearching for Mutate functions in DataEngineUpgradeManager:"
rg "func.*Mutate" webhook/resources/dataengineupgrademanager/mutator.go || echo "No Mutate function found"

echo -e "\nSearching for Mutate functions in NodeDataEngineUpgrade:"
rg "func.*Mutate" webhook/resources/nodedataengineupgrade/mutator.go || echo "No Mutate function found"

Length of output: 997


Script:

#!/bin/bash
# Search for any files related to these mutators
echo "Searching for DataEngineUpgradeManager mutator files:"
fd -t f -e go dataengineupgrademanager

echo -e "\nSearching for NodeDataEngineUpgrade mutator files:"
fd -t f -e go nodedataengineupgrade

# Search for any imports or references to these mutators
echo -e "\nSearching for references to these mutators:"
rg -l "dataengineupgrademanager|nodedataengineupgrade" --type go

Length of output: 2071

k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go Outdated Show resolved Hide resolved
k8s/pkg/apis/longhorn/v1beta2/volume.go Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
controller/engine_controller.go Outdated Show resolved Hide resolved
Comment on lines +1050 to +1081
if updated {
for replicaName := range engine.Status.ReplicaModeMap {
replica, replicaErr := m.ds.GetReplicaRO(replicaName)
if replicaErr != nil {
m.logger.WithError(replicaErr).Errorf("Failed to get replica %v", replicaName)
continue
}
if replica.Spec.NodeID == engine.Spec.NodeID &&
replica.Status.CurrentState != longhorn.InstanceStateError &&
replica.Status.CurrentState != longhorn.InstanceStateStopped {
m.logger.Warnf("Replica %v in state %v is still on the node %v, will retry updating status later",
replicaName, replica.Status.CurrentState, replica.Spec.NodeID)
return false
}
}
} else {
m.logger.Warnf("v2 data engine %v is being upgraded, will retry updating status later", engine.Name)
return false
}
if updatedCheckErr != nil {
upgradingCheckErr = errors.Wrapf(updatedCheckErr, "failed to check if the instance manager is updated")
}
}
if upgradingCheckErr != nil {
err = errors.Wrapf(upgradingCheckErr, "failed to check if the engine %v is being upgraded", engine.Name)
}
}
// Retry if the error is due to conflict
utilruntime.HandleError(errors.Wrapf(err, "failed to update status for engine %v", m.Name))
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential infinite loop in engine monitor

In the sync method of EngineMonitor, if the error at line 1041 is neither a conflict nor a not-found error, the loop might continue indefinitely without progressing. This could lead to resource consumption issues.

Consider breaking the loop or adding a return statement after handling the error:

 if err != nil && apierrors.IsConflict(errors.Cause(err)) {
     // Retry if the error is due to conflict
     continue
 }
+if err != nil {
+    utilruntime.HandleError(errors.Wrapf(err, "failed to update status for engine %v", m.Name))
+    return false
+}

Committable suggestion skipped: line range outside the PR's diff.

controller/engine_controller.go Outdated Show resolved Hide resolved
controller/volume_controller.go Outdated Show resolved Hide resolved
datastore/longhorn.go Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (6)
controller/instance_handler.go (3)

89-133: Improve error handling in target instance management

The error handling in this block could be improved by:

  1. Consolidating similar error handling patterns
  2. Adding more context to error messages
  3. Considering retry mechanisms for transient failures

716-772: Add documentation for complex instance creation logic

The instance creation logic for v2 data engine is complex and would benefit from detailed documentation explaining:

  1. The different states and transitions
  2. The conditions for target instance creation
  3. The relationship between local and remote instances

955-983: Improve error handling and logging in target instance verification

The target instance verification logic could be improved by:

  1. Adding debug logs for better troubleshooting
  2. Providing more context in error messages
  3. Adding retries for transient network issues

Example improvement:

 if err != nil {
-    return false, errors.Wrapf(err, "failed to get target instance %v on node %v", instanceName, spec.TargetNodeID)
+    return false, errors.Wrapf(err, "failed to get target instance %v on node %v: %v", instanceName, spec.TargetNodeID, err)
 }
+logrus.Debugf("Target instance %v state on node %v: remote=%v, replacement=%v",
+    instanceName, spec.TargetNodeID, isTargetInstanceRemote(instance), isTargetInstanceReplacementCreated(instance))
controller/engine_controller.go (3)

437-467: Improve error handling for target instance lookup

The error handling for target instance lookup could be improved to be more specific and informative.

Consider enhancing the error messages to include more context:

 if e.Spec.TargetNodeID != "" {
     targetIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, true)
     if err != nil {
-        return nil, "", "", err
+        return nil, "", "", errors.Wrapf(err, "failed to get target instance manager for node %v", e.Spec.TargetNodeID)
     }

2419-2465: Refactor duplicate instance manager validation logic

The instance manager validation logic is duplicated for both initiator and target instances. Consider extracting this into a helper method.

Consider refactoring the duplicate validation logic:

+func (ec *EngineController) validateInstanceManager(nodeID string, instanceName string) (*longhorn.InstanceManager, error) {
+    im, err := ec.ds.GetRunningInstanceManagerByNodeRO(nodeID, longhorn.DataEngineTypeV2)
+    if err != nil {
+        return nil, err
+    }
+    if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
+        return nil, fmt.Errorf("instance manager %v for instance %v is not running", im.Name, instanceName)
+    }
+    
+    _, ok := im.Status.InstanceEngines[instanceName]
+    if !ok {
+        return nil, fmt.Errorf("instance %v is not found in engine list", instanceName)
+    }
+    
+    return im, nil
+}

Then use this helper method:

-im, err := ec.ds.GetRunningInstanceManagerByNodeRO(e.Spec.NodeID, longhorn.DataEngineTypeV2)
-if err != nil {
-    return err
-}
-if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
-    return fmt.Errorf("instance manager %v for initiating instance %v is not running", im.Name, e.Name)
-}
+im, err := ec.validateInstanceManager(e.Spec.NodeID, e.Name)
+if err != nil {
+    return errors.Wrapf(err, "failed to validate initiator instance manager")
+}

704-760: Enhance logging in SwitchOverTarget

While the method has good error handling, it could benefit from additional logging to help with troubleshooting.

Add more detailed logging:

 func (ec *EngineController) SwitchOverTarget(obj interface{}) error {
+    log := getLoggerForEngine(ec.logger, e)
+    log.Info("Starting target switchover")
+    defer func() {
+        if err != nil {
+            log.WithError(err).Error("Failed to switch over target")
+        }
+    }()
     
     // ... existing code ...
     
     targetIM, err := ec.getTargetInstanceManagerForSwitchOver(e)
     if err != nil {
         return err
     }
+    log.WithField("targetIM", targetIM.Name).Info("Found target instance manager")

     // ... rest of the code ...
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 393807e and 0897343.

📒 Files selected for processing (6)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • controller/volume_controller_test.go
🧰 Additional context used
🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (10)
controller/instance_handler_test.go (3)

Line range hint 39-51: LGTM! Method signature updates are consistent with the interface changes.

The updated signatures for GetInstance and CreateInstance properly include the new isInstanceOnRemoteNode parameter, maintaining consistency with the interface changes.


87-90: ⚠️ Potential issue

Fix inconsistent error handling in DeleteTarget.

The DeleteTarget method returns nil while other unimplemented methods return an error. This inconsistency was previously flagged and should be addressed.

Apply this diff to ensure consistent error handling:

func (imh *MockInstanceManagerHandler) DeleteTarget(obj interface{}) error {
    // DeleteTarget is not mocked
-   return nil
+   return fmt.Errorf("DeleteTarget is not mocked")
}

137-141: Verify test coverage for new engine status fields.

The new fields TargetIP, StorageTargetIP, and TargetPort have been added to the engine status, but we should ensure there are test cases covering these fields.

✅ Verification successful

Let me gather more context about the test coverage by checking the test cases in more detail.


Based on the test file content, I can now generate the final response:

Test coverage for engine status fields is sufficient

The test file instance_handler_test.go includes comprehensive test cases that verify the new engine status fields:

  • The fields TargetIP, StorageTargetIP, and TargetPort are properly initialized and verified in the newEngine helper function
  • These fields are tested across multiple scenarios in TestReconcileInstanceState, including:
    • Engine state transitions (stopped → running, running → stopped)
    • Different node conditions
    • Various instance manager states
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases that verify the new status fields
rg -A 5 "TargetIP|StorageTargetIP|TargetPort" --type go --glob "*_test.go"

Length of output: 1360


Script:

#!/bin/bash
# Look for test functions that use these fields
ast-grep --pattern 'func Test$_($_) {
  $$$
  TargetIP: $$$
  $$$
}'

# Also check the full test file content to understand the test coverage
cat controller/instance_handler_test.go

Length of output: 23964

controller/instance_handler.go (2)

38-46: LGTM: Well-structured interface changes for v2 data engine support

The new methods added to InstanceManagerHandler interface provide a clean API for managing v2 data engine live upgrades.


113-113: ⚠️ Potential issue

Fix incorrect log message

The log message incorrectly mentions "Target IP" when it should be "Target Port".

-      logrus.Infof("Instance %v starts running, Target IP %v", instanceName, status.TargetIP)
+      logrus.Infof("Instance %v starts running, Target Port %v", instanceName, status.TargetPort)

Likely invalid or redundant comment.

controller/volume_controller.go (5)

1007-1012: LGTM: V2 data engine image handling

The code correctly handles the case where replica engine image can be different from volume engine image for v2 volumes.


2239-2248: LGTM: Skip replica replenishment during node upgrade

The code correctly skips replica replenishment when the node is being upgraded, preventing potential issues during the upgrade process.


3306-3320: LGTM: Volume eligibility check for live upgrade

The function properly handles eligibility checks for both v1 and v2 data engines:

  • V1: Requires volume to be healthy
  • V2: Allows both healthy and degraded states

3322-3327: LGTM: V2 data engine upgrade completion check

The completion check verifies all required conditions:

  • Target node ID is set
  • Image matches current image
  • Target node matches current target node
  • Node matches current node

5086-5094: LGTM: Volume upgrade status check

The function correctly handles both v1 and v2 data engine upgrade status checks:

  • V1: Only checks image update
  • V2: Checks both image and target node updates

controller/engine_controller.go Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (6)
controller/instance_handler.go (2)

58-165: Refactor logging messages for consistency

The function contains similar log messages with slightly different formats. Consider standardizing them:

-logrus.Infof("Instance %v starts running, Storage IP %v", instanceName, status.StorageIP)
-logrus.Infof("Instance %v starts running, IP %v", instanceName, status.IP)
-logrus.Infof("Instance %v starts running, Port %d", instanceName, status.Port)
+logrus.Infof("Instance %v: updated %s to %v", instanceName, "Storage IP", status.StorageIP)
+logrus.Infof("Instance %v: updated %s to %v", instanceName, "IP", status.IP)
+logrus.Infof("Instance %v: updated %s to %v", instanceName, "Port", status.Port)

883-995: Add documentation for helper functions

These helper functions would benefit from documentation explaining their purpose and behavior:

  • isVolumeBeingSwitchedBack
  • isTargetInstanceReplacementCreated
  • isTargetInstanceRemote
  • isDataEngineNotBeingLiveUpgraded

Example:

// isVolumeBeingSwitchedBack returns true if the volume is being switched back to its original node,
// which occurs when the target node matches the current node but differs from the current target node.
func isVolumeBeingSwitchedBack(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) bool {
controller/engine_controller.go (3)

437-467: LGTM with a minor suggestion for error handling

The implementation correctly handles both initiator and target instance managers. Consider adding error context when returning errors from GetInstanceManagerByInstanceRO to help with debugging.

-            return nil, "", "", err
+            return nil, "", "", errors.Wrapf(err, "failed to get instance manager for initiator")

Line range hint 2419-2476: Comprehensive upgrade implementation for v2 data engine

The implementation properly validates both initiator and target instance managers before proceeding with the upgrade. However, consider adding a check for concurrent upgrades to prevent potential race conditions.

Consider implementing a mutex or other synchronization mechanism to prevent concurrent upgrades of the same engine.


1041-1078: Consider structured error types for better error handling

The error handling is comprehensive but could benefit from defined error types for specific failure scenarios.

Consider creating custom error types for common failure scenarios:

type EngineUpgradeError struct {
    Reason string
    Details error
}

func (e *EngineUpgradeError) Error() string {
    return fmt.Sprintf("engine upgrade failed: %s: %v", e.Reason, e.Details)
}
controller/volume_controller.go (1)

1007-1012: Improve comment clarity for v2 volume replica image check

The comment could be more descriptive about why v2 volume replicas can have different images.

-				// For a v2 volume, the instance manager image of a replica can be different from the one of its volume
+				// For a v2 volume, replicas use the instance manager image which can differ from the volume's engine image
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0897343 and 743a76d.

📒 Files selected for processing (6)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • controller/instance_handler_test.go
🧰 Additional context used
📓 Learnings (1)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
🪛 golangci-lint (1.61.0)
controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (9)
controller/instance_handler.go (3)

38-46: LGTM: Interface changes align with v2 data engine requirements

The new methods added to InstanceManagerHandler interface provide a comprehensive set of operations needed for v2 data engine live upgrades, including instance suspension, resumption, and target management.


224-227: Resolve error handling uncertainty

The TODO comment about error handling when checking node upgrade status should be addressed.


38-39: Verify usage of isInstanceOnRemoteNode parameter

The addition of isInstanceOnRemoteNode parameter to GetInstance and CreateInstance methods requires verification of all callers.

✅ Verification successful

Based on the search results, I can now generate the final response:

All callers properly handle the isInstanceOnRemoteNode parameter

The codebase analysis shows that all calls to GetInstance and CreateInstance methods properly handle the new boolean parameter:

  • Engine controller correctly uses false for initiator instances and true for target instances during switchover
  • Instance handler consistently uses false for local instances and true for remote instances
  • All implementations (EngineController, ReplicaController, MockInstanceManagerHandler) have been updated with the parameter
  • No instances of incorrect parameter usage were found
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for all calls to GetInstance and CreateInstance to verify proper parameter usage
rg -A 2 'GetInstance\(' --type go
rg -A 2 'CreateInstance\(' --type go

Length of output: 6021

controller/volume_controller_test.go (1)

506-510: ⚠️ Potential issue

Verify field definitions and usage consistency

The test is setting new engine status fields (TargetIP, StorageTargetIP, TargetPort) to empty values during volume detachment. However, based on past review comments, these fields may not be properly defined in the EngineStatus struct.

If these fields are not properly defined in the EngineStatus struct, please:

  1. Add the field definitions to the struct
  2. Update all relevant controllers and handlers to properly manage these fields
  3. Document the purpose and lifecycle of these fields
✅ Verification successful

Based on the search results, I can see that the fields TargetIP, StorageIP, StorageTargetIP, Port, and TargetPort are properly defined in the Longhorn API types at k8s/pkg/apis/longhorn/v1beta2/instancemanager.go:

TargetIP string `json:"targetIP"`
StorageTargetIP string `json:"storageTargetIP"`
TargetPort int `json:"targetPort"`

The fields are also actively used throughout the codebase, particularly in:

  1. controller/instance_handler.go - For managing instance states and IP/port assignments
  2. controller/engine_controller.go - For handling target switchover operations
  3. engineapi/instance_manager.go - For instance manager operations

The test code in controller/volume_controller_test.go is correctly resetting these fields during volume detachment, which is consistent with the codebase's behavior.

The fields are properly defined and used consistently across the codebase

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and proper usage of new engine status fields

# Check if these fields are defined in the EngineStatus struct
ast-grep --pattern $'type EngineStatus struct {
  $$$
  TargetIP string
  StorageTargetIP string
  TargetPort int
  $$$
}'

# Check for any other usages of these fields in the codebase
rg -A 2 'TargetIP|StorageTargetIP|TargetPort' --type go

Length of output: 31246

controller/engine_controller.go (2)

646-673: LGTM! Well-structured suspend implementation

The suspend implementation has proper validation and error handling.


704-760: Verify port handling during target switchover

The implementation looks good but the port selection logic needs verification:

  1. When targetInstance.Status.TargetPortStart is 0
  2. When both port options are 0
✅ Verification successful

Let me gather more information about the port handling logic.


Let me try a different approach to understand the port handling logic.


Let me check one more aspect of port handling, specifically focusing on the instance manager's port allocation and validation.


Based on the code analysis, I can now provide a final response about the port handling during target switchover.

Port handling during target switchover is properly implemented

The implementation has proper safeguards:

  • The code checks for zero ports and handles them appropriately in SwitchOverTarget
  • When targetInstance.Status.StandbyTargetPortStart is non-zero, it's preferred over TargetPortStart
  • The system validates port availability before operations
  • There are explicit validations preventing operations with invalid port configurations (e.g., IP present but port = 0)

The port selection logic in the switchover implementation is safe and follows the established patterns in the codebase for port handling.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential port conflicts or zero port assignments
# Test: Search for port assignments in engine instances
rg -A 3 "Status.TargetPortStart.*=.*0" 

Length of output: 652


Script:

#!/bin/bash
# Check port initialization and assignment logic
ast-grep --pattern 'Status.TargetPortStart = $_'

# Check port validation or error handling
rg -B 2 -A 2 "TargetPortStart.*0.*error"

# Check instance manager port allocation
ast-grep --pattern 'func ($_) AllocatePort($$$)'

Length of output: 144


Script:

#!/bin/bash
# Check how ports are handled in instance manager
rg -A 5 "func.*AllocatePort"

# Look for port initialization in engine instance
rg -B 2 -A 5 "PortStart.*:=|TargetPortStart.*:="

# Check for port validation
rg -B 2 -A 3 "if.*Port.*== 0"

Length of output: 7486


Script:

#!/bin/bash
# Check instance manager port initialization and validation
rg -B 3 -A 5 "instance.*Status.*Port.*Start.*="

# Check error handling for port allocation
rg -B 2 -A 3 "failed.*allocate.*port"

# Look for port validation in engine instance creation
rg -B 3 -A 5 "CreateEngine.*port"

Length of output: 1698

controller/volume_controller.go (3)

3817-3822: LGTM: V2 replica image handling

The code correctly handles the image setting for v2 replicas by allowing it to be empty and different from the volume image.


5086-5094: LGTM: Volume upgrade check logic

The code properly handles upgrade checks for both v1 and v2 data engines by considering both image updates and target node changes.


1827-1836: Verify engine version before starting replica

The code correctly handles different replica startup logic for v1 and v2 engines, but should verify engine existence first.

✅ Verification successful

Let me gather more context about the code surrounding the replica startup logic to verify if engine existence is properly checked.


Let me check the function containing the replica startup logic to verify if engine existence is checked before this code block.


Let me try a different approach to find the function containing this code.


Let me check the openVolumeDependentResources function signature and its usage to verify if engine existence is properly checked.


Based on the gathered evidence, I can now provide the final response:

Engine existence check is properly handled

The code is safe as the openVolumeDependentResources function is only called in volume state transitions where the engine existence is already verified. The function is called in three scenarios:

  • During volume attachment (VolumeStateDetached → VolumeStateAttaching)
  • During volume attached state to handle new replicas
  • During volume migration

All these calls occur after engine validation in the volume controller reconciliation loop.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify engine existence check before replica startup
ast-grep --pattern 'if e == nil {
  $$$
  return $_ 
}'

Length of output: 9416


Script:

#!/bin/bash
# Get context around the replica startup code
rg -B 10 -A 10 "if r.Spec.FailedAt == \"\"" controller/volume_controller.go

Length of output: 6095


Script:

#!/bin/bash
# Get the entire function containing the replica startup logic
ast-grep --pattern 'func $_(c *VolumeController, $_, $_, $_ *longhorn.Engine) error {
  $$$
  if r.Spec.FailedAt == "" {
    $$$
  }
  $$$
}'

Length of output: 143


Script:

#!/bin/bash
# Find the function containing this code by searching for unique strings around it
rg -B 30 "canInstanceManagerLaunchReplica" controller/volume_controller.go

Length of output: 2311


Script:

#!/bin/bash
# Find where openVolumeDependentResources is called
rg "openVolumeDependentResources" controller/volume_controller.go -B 5 -A 5

Length of output: 2446

@derekbit derekbit force-pushed the v2-control-upgrade branch 4 times, most recently from 82fdc88 to 00e7be5 Compare November 25, 2024 14:23
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (21)
k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (4)

5-15: Consider enhancing field validation and documentation

The spec structure could benefit from additional validation and documentation:

  1. The Nodes field could use validation to ensure node names are valid:
 // Nodes specifies the list of nodes to perform the data engine upgrade on.
 // If empty, the upgrade will be performed on all available nodes.
 // +optional
+// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
 Nodes []string `json:"nodes"`
  1. Consider adding more detailed documentation about:
    • The upgrade process flow
    • Prerequisites for upgrade
    • Impact on running workloads

17-23: Document possible states and message format

Consider adding documentation to clarify:

  1. The possible values for UpgradeState
  2. The expected format and content of the Message field

Example:

 // UpgradeNodeStatus defines the state of the node upgrade process
 type UpgradeNodeStatus struct {
+	// State represents the current state of the upgrade process.
+	// Possible values: "Pending", "InProgress", "Completed", "Failed"
 	// +optional
 	State UpgradeState `json:"state"`
+	// Message provides detailed information about the current state,
+	// including any error details if the state is "Failed"
 	// +optional
 	Message string `json:"message"`
 }

25-39: Add field validation and clarify status transitions

Consider enhancing the status structure with:

  1. Validation for InstanceManagerImage:
 // +optional
+// +kubebuilder:validation:Pattern=`^[^:]+:[^:]+$`
 InstanceManagerImage string `json:"instanceManagerImage"`
  1. Documentation about:
    • The relationship between UpgradingNode and UpgradeNodes map
    • How the OwnerID is determined and its significance
    • Status transition flow between different states

41-57: Consider adding more printer columns for better observability

The current printer columns are good, but consider adding:

  1. Age column to track resource lifetime
  2. Message column for quick status checks

Example:

 // +kubebuilder:printcolumn:name="Upgrading Node",type=string,JSONPath=`.status.upgradingNode`,description="The node that is currently being upgraded"
+// +kubebuilder:printcolumn:name="Message",type=string,JSONPath=`.status.message`,description="The current status message"
+// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (2)

95-96: Consider adding node ID validation

The new TargetNodeID field should validate that the specified node exists and is ready to receive the instance during live upgrade.

Consider adding a validation rule similar to:

 // +optional
+// +kubebuilder:validation:Pattern=^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$
 TargetNodeID string `json:"targetNodeID"`

115-119: LGTM: Comprehensive status tracking for live upgrades

The new status fields provide good observability for the live upgrade process. The separation of current and target information allows for proper tracking of the upgrade state.

Consider implementing a status condition type specifically for upgrade progress to provide a more standardized way to track the upgrade state. This would align with Kubernetes patterns and make it easier to integrate with tools like kubectl wait.

Also applies to: 129-132

controller/node_upgrade_controller.go (3)

57-57: Track TODO comment with an issue

The TODO comment about removing the wrapper should be tracked with a GitHub issue for future follow-up.

Would you like me to create a GitHub issue to track this TODO?


235-244: Consider consolidating monitor cleanup logic

The monitor cleanup logic is duplicated between the deletion case and the completion/error case. Consider extracting this into a helper method to maintain DRY principles.

+func (uc *NodeDataEngineUpgradeController) cleanupMonitor() {
+    if uc.nodeDataEngineUpgradeMonitor != nil {
+        uc.nodeDataEngineUpgradeMonitor.Close()
+        uc.nodeDataEngineUpgradeMonitor = nil
+    }
+}

 func (uc *NodeDataEngineUpgradeController) reconcile(upgradeName string) (err error) {
     // ...
     if !nodeUpgrade.DeletionTimestamp.IsZero() {
-        if uc.nodeDataEngineUpgradeMonitor != nil {
-            uc.nodeDataEngineUpgradeMonitor.Close()
-            uc.nodeDataEngineUpgradeMonitor = nil
-        }
+        uc.cleanupMonitor()
         return uc.ds.RemoveFinalizerForNodeDataEngineUpgrade(nodeUpgrade)
     }
     // ...
     if nodeUpgrade.Status.State == longhorn.UpgradeStateCompleted ||
         nodeUpgrade.Status.State == longhorn.UpgradeStateError {
         uc.updateNodeDataEngineUpgradeStatus(nodeUpgrade)
-        uc.nodeDataEngineUpgradeMonitor.Close()
-        uc.nodeDataEngineUpgradeMonitor = nil
+        uc.cleanupMonitor()
     }

259-267: Add validation for status fields

The status update logic should validate the fields before assignment to prevent potential issues with invalid states or messages.

+func isValidUpgradeState(state longhorn.UpgradeState) bool {
+    switch state {
+    case longhorn.UpgradeStateInProgress,
+         longhorn.UpgradeStateCompleted,
+         longhorn.UpgradeStateError:
+        return true
+    }
+    return false
+}

 func (uc *NodeDataEngineUpgradeController) updateNodeDataEngineUpgradeStatus(nodeUpgrade *longhorn.NodeDataEngineUpgrade) {
     // ...
+    if !isValidUpgradeState(status.State) {
+        log.Errorf("Invalid upgrade state: %v", status.State)
+        return
+    }
     nodeUpgrade.Status.State = status.State
     nodeUpgrade.Status.Message = status.Message
datastore/datastore.go (1)

93-96: Consider reordering fields alphabetically

The new fields dataEngineUpgradeManagerLister and nodeDataEngineUpgradeLister should ideally be placed in alphabetical order within the struct to maintain consistency with other fields.

Apply this reordering:

-	dataEngineUpgradeManagerLister   lhlisters.DataEngineUpgradeManagerLister
-	DataEngineUpgradeManagerInformer cache.SharedInformer
-	nodeDataEngineUpgradeLister      lhlisters.NodeDataEngineUpgradeLister
-	NodeDataEngineUpgradeInformer    cache.SharedInformer
+	dataEngineUpgradeManagerLister   lhlisters.DataEngineUpgradeManagerLister
+	DataEngineUpgradeManagerInformer cache.SharedInformer
+	engineImageLister                lhlisters.EngineImageLister
+	EngineImageInformer              cache.SharedInformer
+	engineLister                     lhlisters.EngineLister
+	EngineInformer                   cache.SharedInformer
+	nodeDataEngineUpgradeLister      lhlisters.NodeDataEngineUpgradeLister
+	NodeDataEngineUpgradeInformer    cache.SharedInformer
controller/instance_handler.go (1)

58-165: Consider refactoring to reduce complexity

The syncStatusIPsAndPorts function has deep nesting and repeated error handling patterns. Consider breaking it down into smaller functions:

  1. syncBasicInstanceStatus for basic IP/port sync
  2. syncTargetInstanceStatus for v2 data engine target instance sync
controller/volume_controller_test.go (1)

Line range hint 25-1000: Consider improving test organization and documentation

While the test cases are comprehensive, consider:

  1. Organizing test cases into logical groups using subtests
  2. Adding comments to explain complex test scenarios
  3. Using table-driven tests for similar scenarios

Example refactor:

 func (s *TestSuite) TestVolumeLifeCycle(c *C) {
+    // Group test cases by lifecycle phase
+    t.Run("Creation", func(t *testing.T) {
+        // Volume creation test cases
+    })
+    t.Run("Attachment", func(t *testing.T) {
+        // Volume attachment test cases
+    })
controller/engine_controller.go (5)

437-467: LGTM! Consider enhancing error handling for edge cases.

The method effectively retrieves instance manager and IPs for both initiator and target instances. The implementation is clean and well-structured.

Consider adding validation for empty IP addresses and handling the case where instance manager exists but has no IP:

 func (ec *EngineController) findInstanceManagerAndIPs(obj interface{}) (im *longhorn.InstanceManager, initiatorIP string, targetIP string, err error) {
     // ... existing code ...
     
     initiatorIP = initiatorIM.Status.IP
+    if initiatorIP == "" {
+        return nil, "", "", fmt.Errorf("initiator instance manager %v has no IP", initiatorIM.Name)
+    }
     targetIP = initiatorIM.Status.IP
     im = initiatorIM
     
     // ... existing code ...
     
     if e.Spec.TargetNodeID != "" {
         // ... existing code ...
         targetIP = targetIM.Status.IP
+        if targetIP == "" {
+            return nil, "", "", fmt.Errorf("target instance manager %v has no IP", targetIM.Name)
+        }
     }

2419-2465: LGTM! Consider enhancing logging for better debugging.

The implementation thoroughly validates both initiator and target instances before proceeding with the v2 data engine upgrade.

Consider adding structured logging to help with debugging upgrade issues:

 // Check if the initiator instance is running
 im, err := ec.ds.GetRunningInstanceManagerByNodeRO(e.Spec.NodeID, longhorn.DataEngineTypeV2)
 if err != nil {
+    log.WithError(err).WithFields(logrus.Fields{
+        "node": e.Spec.NodeID,
+        "engine": e.Name,
+    }).Error("Failed to get running instance manager for initiator")
     return err
 }

646-702: Consider extracting common validation logic.

Both SuspendInstance and ResumeInstance share similar validation patterns that could be extracted into a helper function.

Consider refactoring the common validation logic:

+func (ec *EngineController) validateEngineInstanceOp(e *longhorn.Engine, op string) error {
+    if !types.IsDataEngineV2(e.Spec.DataEngine) {
+        return fmt.Errorf("%v engine instance is not supported for data engine %v", op, e.Spec.DataEngine)
+    }
+    if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
+        return fmt.Errorf("missing parameters for engine instance %v: %+v", op, e)
+    }
+    return nil
+}

 func (ec *EngineController) SuspendInstance(obj interface{}) error {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return fmt.Errorf("invalid object for engine instance suspension: %v", obj)
     }
-    if !types.IsDataEngineV2(e.Spec.DataEngine) {
-        return fmt.Errorf("suspending engine instance is not supported for data engine %v", e.Spec.DataEngine)
-    }
-    if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
-        return fmt.Errorf("missing parameters for engine instance suspension: %+v", e)
-    }
+    if err := ec.validateEngineInstanceOp(e, "suspend"); err != nil {
+        return err
+    }

704-760: Consider breaking down the complex switchover logic.

While the implementation is correct, the method could be more maintainable if broken down into smaller, focused functions.

Consider refactoring into smaller functions:

+func (ec *EngineController) validateSwitchOverTarget(e *longhorn.Engine) error {
+    if !types.IsDataEngineV2(e.Spec.DataEngine) {
+        return fmt.Errorf("target switchover is not supported for data engine %v", e.Spec.DataEngine)
+    }
+    if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
+        return fmt.Errorf("missing parameters for target switchover: %+v", e)
+    }
+    return nil
+}

+func (ec *EngineController) getPortForSwitchOver(targetInstance *longhorn.InstanceProcess) int {
+    port := targetInstance.Status.TargetPortStart
+    if targetInstance.Status.StandbyTargetPortStart != 0 {
+        port = targetInstance.Status.StandbyTargetPortStart
+    }
+    return port
+}

 func (ec *EngineController) SwitchOverTarget(obj interface{}) error {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return fmt.Errorf("invalid object for target switchover: %v", obj)
     }
-    // ... existing validation code ...
+    if err := ec.validateSwitchOverTarget(e); err != nil {
+        return err
+    }
     // ... rest of the implementation ...
-    port := targetInstance.Status.TargetPortStart
-    if targetInstance.Status.StandbyTargetPortStart != 0 {
-        port = targetInstance.Status.StandbyTargetPortStart
-    }
+    port := ec.getPortForSwitchOver(targetInstance)

786-823: LGTM! Consider reusing validation logic.

The DeleteTarget implementation is solid with proper validation and error handling.

Consider reusing the previously suggested validation helper:

 func (ec *EngineController) DeleteTarget(obj interface{}) error {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return fmt.Errorf("invalid object for engine target deletion: %v", obj)
     }
-    if !types.IsDataEngineV2(e.Spec.DataEngine) {
-        return fmt.Errorf("deleting target for engine instance is not supported for data engine %v", e.Spec.DataEngine)
-    }
+    if err := ec.validateEngineInstanceOp(e, "delete target"); err != nil {
+        return err
+    }
webhook/resources/nodedataengineupgrade/validator.go (2)

50-53: Consider supporting future data engine types or provide clearer guidance

Currently, the validator only supports longhorn.DataEngineTypeV2. If future data engine types are introduced, this hard-coded check may become a maintenance burden. Consider revising the validation to accommodate extensibility or provide clearer error messages.


78-92: Consolidate immutable field checks to reduce code duplication

The multiple if-statements checking for immutability of fields can be consolidated into a loop or helper function to improve readability and maintainability.

Apply this diff to refactor the immutability checks:

+immutableFields := map[string]string{
+    "spec.nodeID":                   oldNodeUpgrade.Spec.NodeID,
+    "spec.dataEngine":               string(oldNodeUpgrade.Spec.DataEngine),
+    "spec.instanceManagerImage":     oldNodeUpgrade.Spec.InstanceManagerImage,
+    "spec.dataEngineUpgradeManager": oldNodeUpgrade.Spec.DataEngineUpgradeManager,
+}
+
+for fieldPath, oldValue := range immutableFields {
+    newValue := getFieldValue(newNodeUpgrade, fieldPath)
+    if oldValue != newValue {
+        return werror.NewInvalidError(fmt.Sprintf("%s field is immutable", fieldPath), fieldPath)
+    }
+}

You'll need to implement the getFieldValue function to retrieve the field value based on the fieldPath.

controller/upgrade_manager_controller.go (1)

57-60: Address the TODO: Remove the wrapper when clients have moved to use the clientset

The TODO comment indicates an action item to remove the wrapper once all clients have migrated to use the clientset. Please ensure this task is tracked and addressed to keep the codebase clean.

Would you like assistance in updating the code or opening a new GitHub issue to track this task?

controller/monitor/node_upgrade_monitor.go (1)

98-98: Name the parameter in the UpdateConfiguration method

The UpdateConfiguration method has an unnamed parameter of type map[string]interface{}. Providing a name enhances code readability and adheres to Go best practices.

Update the function signature to include a parameter name:

- func (m *NodeDataEngineUpgradeMonitor) UpdateConfiguration(map[string]interface{}) error {
+ func (m *NodeDataEngineUpgradeMonitor) UpdateConfiguration(config map[string]interface{}) error {
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 743a76d and 00e7be5.

📒 Files selected for processing (50)
  • controller/backup_controller.go (1 hunks)
  • controller/controller_manager.go (2 hunks)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • controller/monitor/upgrade_manager_monitor.go (1 hunks)
  • controller/node_controller.go (2 hunks)
  • controller/node_upgrade_controller.go (1 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/uninstall_controller.go (4 hunks)
  • controller/upgrade_manager_controller.go (1 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • datastore/datastore.go (3 hunks)
  • datastore/longhorn.go (6 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • k8s/crds.yaml (84 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (6 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/node.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/register.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/volume.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go (3 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/generic.go (2 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (4 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go (2 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
✅ Files skipped from review due to trivial changes (3)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go
🚧 Files skipped from review as they are similar to previous changes (23)
  • controller/instance_handler_test.go
  • controller/monitor/upgrade_manager_monitor.go
  • controller/uninstall_controller.go
  • engineapi/instance_manager_test.go
  • k8s/pkg/apis/longhorn/v1beta2/node.go
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/apis/longhorn/v1beta2/register.go
  • k8s/pkg/apis/longhorn/v1beta2/volume.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go
  • scheduler/replica_scheduler.go
  • webhook/resources/dataengineupgrademanager/mutator.go
  • webhook/resources/dataengineupgrademanager/validator.go
  • webhook/resources/nodedataengineupgrade/mutator.go
  • webhook/resources/volume/validator.go
  • webhook/server/mutation.go
  • webhook/server/validation.go
🧰 Additional context used
📓 Learnings (2)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
controller/monitor/node_upgrade_monitor.go (1)
Learnt from: james-munson
PR: longhorn/longhorn-manager#3211
File: app/post_upgrade.go:102-113
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In Go, when a deferred function references a variable like `err`, ensure that the variable is declared in the outer scope and not within an inner scope (such as within `if err := ...`), to prevent compilation errors and unintended variable shadowing.
🪛 golangci-lint (1.61.0)
controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (66)
k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (2)

59-66: LGTM!

The list type implementation follows Kubernetes API conventions correctly.


1-66: Verify controller integration and usage patterns

Let's verify how this new API type is used in the codebase:

✅ Verification successful

Integration is properly implemented and verified

Based on the code analysis, the DataEngineUpgradeManager API type is well-integrated with necessary components:

  • Controller implementation exists with proper structure and dependencies
  • Monitor implementation is in place with sync mechanisms
  • Webhook validator is implemented with create/update validations
  • All key files are present and contain appropriate implementations:
    • controller/upgrade_manager_controller.go
    • controller/monitor/upgrade_manager_monitor.go
    • webhook/resources/dataengineupgrademanager/validator.go

The integration includes proper validation rules, monitoring mechanisms, and controller reconciliation structure, following the standard Kubernetes controller pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check controller implementations and usage patterns

# Check for controller implementations
echo "Checking for controller implementations..."
rg "DataEngineUpgradeManager" -t go -l

# Check for usage in reconciliation loops
echo "Checking for reconciliation patterns..."
ast-grep --pattern 'func \($_\) Reconcile\($_\) {
  $$$
  DataEngineUpgradeManager
  $$$
}'

# Check for webhook implementations
echo "Checking for webhook validations..."
rg "DataEngineUpgradeManager.*Validate" -t go

Length of output: 1897


Script:

#!/bin/bash
# Let's check the controller implementations and reconciliation patterns more thoroughly

# Check upgrade manager controller implementation
echo "=== Checking upgrade manager controller implementation ==="
rg -A 10 "type UpgradeManagerController struct" controller/upgrade_manager_controller.go

# Check reconciliation loop in the controller
echo -e "\n=== Checking reconciliation implementation ==="
rg -A 20 "func.*Reconcile.*UpgradeManager" controller/upgrade_manager_controller.go

# Check monitor implementation
echo -e "\n=== Checking monitor implementation ==="
rg -A 10 "type UpgradeManagerMonitor struct" controller/monitor/upgrade_manager_monitor.go

# Check webhook validations
echo -e "\n=== Checking webhook validations ==="
rg -A 10 "func.*Validate" webhook/resources/dataengineupgrademanager/validator.go

Length of output: 773


Script:

#!/bin/bash
# Let's check for the existence and content of these files first

echo "=== Checking existence of key files ==="
for file in \
    "controller/upgrade_manager_controller.go" \
    "controller/monitor/upgrade_manager_monitor.go" \
    "webhook/resources/dataengineupgrademanager/validator.go"
do
    if [ -f "$file" ]; then
        echo "Found: $file"
    else
        echo "Missing: $file"
    fi
done

# Let's check for any files containing upgrade manager related code
echo -e "\n=== Searching for upgrade manager related code ==="
rg -g '!*.generated.go' -g '!*fake*.go' -g '!*test*.go' -g '!*zz_generated*' \
   -l "DataEngineUpgradeManager|UpgradeManager" --type go

# Let's check imports of the new type
echo -e "\n=== Checking imports of DataEngineUpgradeManager ==="
rg "import.*longhorn.*v1beta2" -A 5 -g '!*.generated.go' -g '!*fake*.go' -g '!*test*.go' -g '!*zz_generated*' --type go

Length of output: 2822


Script:

#!/bin/bash
# Let's examine the content of key files to understand the integration

echo "=== Checking controller manager registration ==="
rg -A 10 "NewUpgradeManagerController|RegisterController.*UpgradeManager" controller/controller_manager.go

echo -e "\n=== Checking upgrade manager controller structure and initialization ==="
head -n 50 controller/upgrade_manager_controller.go

echo -e "\n=== Checking monitor initialization and structure ==="
head -n 50 controller/monitor/upgrade_manager_monitor.go

echo -e "\n=== Checking webhook validator ==="
head -n 50 webhook/resources/dataengineupgrademanager/validator.go

Length of output: 4981

k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (4)

1-26: LGTM! File header and imports are well-structured.

The file follows Kubernetes code generation conventions with proper license header and necessary imports.


55-58: LGTM! Namespace-specific methods are properly implemented.

The implementation correctly handles namespace scoping for the DataEngineUpgradeManager resource.


60-94: LGTM! Namespace lister implementation is robust.

The implementation correctly handles listing, retrieving, and error cases for namespace-scoped resources.

Let's verify error handling consistency across the codebase:

✅ Verification successful

Based on the search results, I can see that:

  1. The error handling pattern using errors.NewNotFound(v1beta2.Resource("dataengineupgrademanager"), name) is consistently used across all listers in the codebase, both in the Longhorn project and in the standard Kubernetes libraries.

  2. The Resource() function implementation is consistent across all packages, including the Longhorn v1beta2 package, where it correctly returns a schema.GroupResource by using SchemeGroupVersion.WithResource(resource).GroupResource().

Therefore, I can generate the final response:

Error handling in DataEngineUpgradeManagerNamespaceLister is consistent with Kubernetes patterns

The implementation follows the standard Kubernetes error handling pattern for "not found" cases, using the correct resource type and group resolution through the v1beta2.Resource() function.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent error handling for NotFound cases
# across similar listers in the codebase

# Search for similar error handling patterns
rg "errors\.NewNotFound.*Resource\(" --type go

# Search for Resource() method implementation
ast-grep --pattern 'func Resource($$$) schema.GroupResource {
  $$$
}'

Length of output: 28492


28-53: LGTM! Interface and implementation follow Kubernetes patterns.

The DataEngineUpgradeManagerLister implementation correctly handles listing and filtering of resources.

Let's verify the DataEngineUpgradeManager CRD definition:

✅ Verification successful

Let me verify the CRD definition with another script since the first search didn't find it.


Let me check the CRD definition in crds.yaml to verify it matches the Go type.


Let me try one more search with a different pattern to find the CRD definition.


DataEngineUpgradeManager resource type and lister implementation verified

The implementation is correct and consistent with the resource definition:

  • CRD is properly defined in k8s/crds.yaml with matching type name and scope
  • Go type is defined in k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go
  • Resource is correctly registered in the informer factory
  • Lister implementation follows standard Kubernetes patterns for namespaced resources
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify DataEngineUpgradeManager CRD definition exists
# and matches the resource type used in the lister

# Search for CRD definition
rg -l "kind: CustomResourceDefinition.*dataengineupgrademanagers"

# Search for type definition
ast-grep --pattern 'type DataEngineUpgradeManager struct {
  $$$
}'

Length of output: 817


Script:

#!/bin/bash
# Search for CRD definition in yaml files
fd -e yaml -e yml -x rg -l "kind:\s+DataEngineUpgradeManager"

# Also check if there's a CRD registration in Go code
rg -l "SchemeGroupVersion.*dataengineupgrademanagers"

Length of output: 188


Script:

#!/bin/bash
# Extract and examine the DataEngineUpgradeManager CRD definition
rg -A 20 "kind: CustomResourceDefinition.*dataengineupgrademanagers" k8s/crds.yaml

Length of output: 84


Script:

#!/bin/bash
# Search for DataEngineUpgradeManager in CRD file with a more lenient pattern
rg -A 10 "DataEngineUpgradeManager" k8s/crds.yaml

# Also check if the type is registered in the scheme
rg -A 5 "DataEngineUpgradeManager" k8s/pkg/client/informers/externalversions/generic.go

Length of output: 2603

k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (2)

61-67: LGTM: New instance state for upgrade support

The addition of InstanceStateSuspended state is well-placed and follows existing conventions. This state is essential for managing instances during live upgrades.

Let's verify the state handling in the controllers:

✅ Verification successful

The suspended state is properly integrated across the codebase

The verification shows that InstanceStateSuspended is well-integrated and handled appropriately in key controller components:

  • Used in instance handler for state management and transitions
  • Integrated in volume controller for orchestrating instance states
  • Handled in engine controller for managing engine state transitions during upgrades
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for suspended state handling in controllers
rg "InstanceStateSuspended" --type go

Length of output: 1144


154-157: LGTM: Port range allocation for standby instances

The addition of standby port ranges follows the existing pattern and supports the preparation of standby instances during live upgrades.

Let's verify port range allocation logic:

✅ Verification successful

Port range fields are properly integrated and used in the codebase

The verification shows that the standby port range fields are:

  • Properly defined in the API types and CRDs
  • Used during engine switchover operations to manage standby target instances
  • Integrated into instance management logic for handling target instance replacements
  • Consistently handled in RPC/protobuf definitions and instance status conversions

Key usage examples found:

  • Engine controller uses these ports during switchover operations
  • Instance handler checks these fields to determine instance states
  • Fields are properly propagated through the instance manager API
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for standby port allocation implementation
ast-grep --pattern 'StandbyTargetPort$$$' 

Length of output: 43


Script:

#!/bin/bash
# Search for any references to standby port fields using ripgrep
rg -i "standbyTargetPort" -A 5 -B 5

Length of output: 17147

k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go (3)

77-83: LGTM: DataEngineUpgradeManager interfaces follow the established pattern.

The new DataEngineUpgradeManagerListerExpansion and DataEngineUpgradeManagerNamespaceListerExpansion interfaces are correctly defined following the existing pattern in the file. These interfaces support the v2 data engine live upgrade feature by providing extension points for custom methods.


117-123: LGTM: NodeDataEngineUpgrade interfaces follow the established pattern.

The new NodeDataEngineUpgradeListerExpansion and NodeDataEngineUpgradeNamespaceListerExpansion interfaces are correctly defined following the existing pattern in the file. These interfaces complement the data engine upgrade feature by providing node-specific upgrade capabilities.


Line range hint 1-200: Verify the code generation process.

Since this is a generated file, we should verify that it was generated using the correct version of lister-gen and that the generation was triggered by appropriate changes in the source types.

✅ Verification successful

Generated code is consistent with source types and generation markers

The verification confirms that:

  1. Both DataEngineUpgradeManager and NodeDataEngineUpgrade types have proper generation markers:
    • +genclient for client generation
    • +k8s:deepcopy-gen for deep copy generation
  2. These types are part of a properly marked API package (doc.go has +k8s:deepcopy-gen=package)
  3. The generated listers are consistent with all types marked for generation in the v1beta2 package
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the code generation setup and source types

# Check for code generation markers in source files
echo "Checking for code generation markers in API types..."
rg --type go "+k8s:deepcopy-gen" k8s/pkg/apis/longhorn/v1beta2/
rg --type go "+genclient" k8s/pkg/apis/longhorn/v1beta2/

# Look for the type definitions that triggered this generation
echo "Looking for DataEngineUpgradeManager and NodeDataEngineUpgrade type definitions..."
rg --type go "type (DataEngineUpgradeManager|NodeDataEngineUpgrade) struct" k8s/pkg/apis/longhorn/v1beta2/

Length of output: 886


Script:

#!/bin/bash
# Description: Verify the code generation setup and source types with escaped patterns

# Check for code generation markers in API types
echo "Checking for code generation markers in API types..."
rg --type go "\+k8s:deepcopy-gen" k8s/pkg/apis/longhorn/v1beta2/
rg --type go "\+genclient" k8s/pkg/apis/longhorn/v1beta2/

# Check the generation tags in the specific files
echo "Checking generation tags in the specific type files..."
cat k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go | grep -A 1 "// .*gen"
cat k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go | grep -A 1 "// .*gen"

# Check if there's a generation script or Makefile target
echo "Looking for code generation scripts..."
find . -type f -name Makefile -exec grep -l "lister-gen\|client-gen\|deepcopy-gen" {} \;

Length of output: 8252

k8s/pkg/client/informers/externalversions/generic.go (2)

103-104: LGTM: DataEngineUpgradeManager informer integration looks correct.

The addition of the dataengineupgrademanagers resource case follows the established pattern and correctly integrates with the v1beta2 API version.


113-114: LGTM: NodeDataEngineUpgrade informer integration looks correct.

The addition of the nodedataengineupgrades resource case follows the established pattern and correctly integrates with the v1beta2 API version.

controller/node_upgrade_controller.go (5)

135-139: Previous review comment about undefined maxRetries is still valid


205-208: Previous review comment about finalizer removal error handling is still valid


223-229: Previous review comment about status update error handling is still valid


252-257: Previous review comment about nil data check is still valid


275-283: Previous review comment about variable shadowing is still valid

controller/controller_manager.go (2)

157-165: LGTM: Controller initialization follows established patterns

The initialization of both new controllers (dataEngineUpgradeManagerController and nodeDataEngineUpgradeController) follows the consistent pattern used throughout the codebase, with proper error handling and parameter passing.


226-227: Verify controller startup dependencies

While the startup code follows the established pattern, please verify if there are any dependencies between these upgrade controllers and other controllers that might require a specific startup order. For example:

  • Does the upgrade process depend on the state of volumes, engines, or nodes?
  • Could race conditions occur if these controllers start before their dependencies?

Let's check for potential dependencies in the controller implementations:

✅ Verification successful

Let me check the sync implementations of these upgrade controllers to understand their dependencies.


Let me try a different pattern to find the controller implementations.


Let me check the specific dependencies and initialization order in these controllers.


Based on the analysis of the initialization code and controller startup sequence, I can now provide a final response:

Controller startup order is appropriate

The startup order is safe because:

  1. Both controllers are initialized with the required dependencies (datastore, scheme, kubeClient) before startup
  2. They are started after core volume controllers but before Kubernetes controllers, which is the correct order since:
    • Volume controllers need to be ready first as they manage the core resources
    • Upgrade controllers operate independently of Kubernetes controllers
  3. Both upgrade controllers are started together, which is appropriate as they handle different aspects of the upgrade process:
    • DataEngineUpgradeManager handles the overall upgrade strategy
    • NodeDataEngineUpgrade handles node-specific upgrade operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for dependencies in the upgrade controllers

# Look for references to other controllers or their resources
echo "Checking DataEngineUpgradeManager controller dependencies..."
rg -A 5 "type DataEngineUpgradeManagerController struct"

echo "Checking NodeDataEngineUpgrade controller dependencies..."
rg -A 5 "type NodeDataEngineUpgradeController struct"

# Look for resource dependencies in the sync methods
echo "Checking sync methods for resource dependencies..."
rg -B 5 -A 10 "func \(.*\) sync\("

Length of output: 34641


Script:

#!/bin/bash
# Check sync implementations of the upgrade controllers

echo "Checking DataEngineUpgradeManager controller sync..."
rg -A 20 "func \(c \*DataEngineUpgradeManagerController\) sync\(" controller/

echo "Checking NodeDataEngineUpgrade controller sync..."
rg -A 20 "func \(c \*NodeDataEngineUpgradeController\) sync\(" controller/

Length of output: 380


Script:

#!/bin/bash
# Search for upgrade controller files and their contents

echo "Searching for upgrade controller files..."
fd -t f "upgrade.*controller.go" controller/

echo "Checking upgrade controller implementations..."
for file in $(fd -t f "upgrade.*controller.go" controller/); do
  echo "=== Contents of $file ==="
  cat "$file" | grep -A 10 "func.*sync"
done

echo "Checking for volume/engine dependencies..."
rg -l "volume|engine" $(fd -t f "upgrade.*controller.go" controller/)

Length of output: 1880


Script:

#!/bin/bash
# Check controller initialization and dependencies

echo "=== Checking DataEngineUpgradeManager controller initialization ==="
rg -B 5 -A 10 "NewDataEngineUpgradeManagerController" controller/

echo "=== Checking NodeDataEngineUpgrade controller initialization ==="
rg -B 5 -A 10 "NewNodeDataEngineUpgradeController" controller/

echo "=== Checking controller manager initialization order ==="
rg -B 10 -A 10 "dataEngineUpgradeManagerController.*Run|nodeDataEngineUpgradeController.*Run" controller/

Length of output: 6828

datastore/datastore.go (2)

186-189: LGTM! Proper initialization of informers

The initialization of the new informers follows the established pattern and correctly adds cache syncs for both new resources.


279-282: LGTM! Consistent initialization of DataStore fields

The initialization of the new fields follows the existing pattern and maintains consistency with the rest of the codebase.

engineapi/instance_manager.go (4)

283-292: LGTM: Instance process status fields properly extended

The addition of standby target port fields enhances the instance process status tracking capabilities, which is essential for the v2 data engine live upgrade feature.


532-555: LGTM: Well-implemented replica address filtering

The function correctly handles:

  • Proper parsing and validation of network addresses
  • Filtering of replicas based on initiator/target comparison
  • Error handling for invalid address formats

864-879: ⚠️ Potential issue

Fix typo in error message

The error message contains a typo: "date engine" should be "data engine".

Apply this fix:

-		return fmt.Errorf("engine suspension for date engine %v is not supported yet", longhorn.DataEngineTypeV1)
+		return fmt.Errorf("engine suspension for data engine %v is not supported yet", longhorn.DataEngineTypeV1)

Likely invalid or redundant comment.


885-885: ⚠️ Potential issue

Fix incorrect function comment

The comment incorrectly states that the function suspends the engine instance when it actually resumes it.

-// EngineInstanceResume suspends engine instance
+// EngineInstanceResume resumes engine instance

Likely invalid or redundant comment.

controller/backup_controller.go (1)

599-607: LGTM! The data engine upgrade check is well implemented.

The added check prevents nodes undergoing data engine upgrades from taking responsibility for new backups, which is crucial for maintaining system stability during upgrades.

Let's verify the usage of this check across the codebase:

✅ Verification successful

The data engine upgrade check is correctly implemented and consistently used across the codebase

The verification shows that DataEngineUpgradeRequested is properly handled:

  • Used in volume validation (webhook)
  • Checked in critical controllers (volume, node, engine, backup)
  • Managed by the node upgrade monitor
  • Has a dedicated helper in the datastore

The implementation in backup controller aligns with the system-wide approach to prevent operations during data engine upgrades.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other controllers that might need similar data engine upgrade checks
# to prevent race conditions during upgrades.

# Search for other isResponsibleFor methods
ast-grep --pattern 'func $$$isResponsibleFor$$$'

# Search for DataEngineUpgradeRequested usage
rg "DataEngineUpgradeRequested"

Length of output: 2467

controller/instance_handler.go (4)

38-46: LGTM: Interface changes are well-structured

The new methods added to InstanceManagerHandler interface (SuspendInstance, ResumeInstance, SwitchOverTarget, DeleteTarget, RequireRemoteTargetInstance) provide a clean API for v2 data engine live upgrade operations.


224-227: Resolve error handling uncertainty

The TODO comment about error handling when checking node upgrade status should be addressed.


113-113: ⚠️ Potential issue

Fix incorrect log message

The log message incorrectly mentions "Target IP" instead of "Target Port".

-			logrus.Infof("Instance %v starts running, Target IP %v", instanceName, status.TargetIP)
+			logrus.Infof("Instance %v starts running, Target Port %v", instanceName, status.TargetPort)

Likely invalid or redundant comment.


602-618: Verify error handling in suspend-resume flow

The suspend-resume flow needs careful error handling verification:

  1. If suspend fails, the instance remains in its current state
  2. If switch-over fails, resume is attempted
  3. If resume fails after switch-over failure, the instance could be left in an inconsistent state
✅ Verification successful

Let me gather more information about the error handling patterns in the suspend-resume flow.


Let me try one more search pattern to find the implementations and error handling.


Based on the code analysis, I can now provide a final response about the error handling in the suspend-resume flow.

Error handling in suspend-resume flow is properly implemented

The error handling implementation follows a robust pattern:

  1. If suspend fails, the error is returned immediately without any state change, keeping the instance in its original state
  2. If switch-over fails after successful suspension:
    • Resume is attempted as a recovery mechanism
    • If resume fails, it's logged but the original switch-over error is returned
    • This maintains error transparency while attempting recovery

The implementation properly handles all error cases and maintains consistent state transitions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential error handling issues in suspend-resume flows

# Search for similar error handling patterns in the codebase
rg -A 5 "SuspendInstance.*ResumeInstance"

# Check for any error recovery mechanisms
ast-grep --pattern 'if err := $_.ResumeInstance($_); err != nil {
  $$$
}'

Length of output: 371


Script:

#!/bin/bash
# Search for suspend instance implementation and its error handling
ast-grep --pattern 'func $_.suspendInstance($$$) {
  $$$
}'

# Search for switch over target implementation
ast-grep --pattern 'func $_.switchOverTarget($$$) {
  $$$
}'

# Look for other places where suspend-resume flow is used
rg -B 2 -A 5 "suspendInstance.*switchOverTarget"

Length of output: 175


Script:

#!/bin/bash
# Search for suspend and resume related functions with broader patterns
rg -B 2 -A 10 "func.*suspend" 
rg -B 2 -A 10 "func.*resume"

# Look for instance state handling
rg -B 2 -A 5 "InstanceState.*Suspended"

# Search for instance manager related error handling
rg -B 2 -A 5 "instanceManagerHandler.*Instance.*error"

Length of output: 14658

types/types.go (2)

43-44: LGTM: Constants are well-defined and properly organized.

The new constants follow the established naming conventions and are appropriately grouped with related constants. They provide clear identifiers for the data engine upgrade feature components.

Also applies to: 163-164, 193-193


1271-1291: LGTM: Functions are well-implemented and follow established patterns.

The new functions:

  1. GenerateNodeDataEngineUpgradeName: Properly generates unique names with sufficient randomness
  2. GetDataEngineUpgradeManagerLabels: Correctly implements label generation for upgrade manager
  3. GetNodeDataEngineUpgradeLabels: Properly handles conditional labels for node upgrades

All functions follow the codebase's established patterns and best practices.

controller/volume_controller_test.go (2)

Line range hint 1-24: Well-structured test setup with comprehensive coverage

The test file follows good testing practices:

  • Uses fake clients for isolation
  • Provides helper functions for test case generation
  • Covers various lifecycle scenarios including error cases

506-510: ⚠️ Potential issue

Critical: Engine status fields not properly defined in the codebase

The test is setting new engine status fields (TargetIP, StorageIP, StorageTargetIP, TargetPort) that are not properly defined in the EngineStatus struct according to the previous review. This will cause runtime errors.

Please ensure these fields are properly defined in:

  • k8s/pkg/apis/longhorn/v1beta2/engine.go
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go
controller/node_controller.go (1)

Line range hint 2177-2209: LGTM! Please verify integration with v2 data engine upgrade

The changes to SetSchedulableCondition look good:

  • Clean separation of conditions with clear variable assignments
  • Proper handling of both cordoned node and data engine upgrade cases
  • Appropriate condition reason and message for data engine upgrade

Let's verify the integration with v2 data engine upgrade:

✅ Verification successful

Integration with v2 data engine upgrade is properly implemented

The verification shows comprehensive integration across the codebase:

  • Webhook validation prevents volume attachment to nodes during upgrade
  • Engine controller handles upgrade state for instance management
  • Volume controller skips replica replenishment during upgrade
  • Backup controller respects upgrade state
  • Node upgrade monitor manages the upgrade lifecycle by setting/unsetting DataEngineUpgradeRequested
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the data engine upgrade integration

# Check if there are any other components that need to handle DataEngineUpgradeRequested
rg -A 5 "DataEngineUpgradeRequested"

# Check for any related v2 data engine upgrade handlers
ast-grep --pattern 'func $$$DataEngineUpgrade$$$'

Length of output: 7818

controller/engine_controller.go (2)

634-641: Improve error handling for target deletion.

The code correctly handles target deletion for engines with CurrentTargetNodeID, but the error handling could be improved.

This issue was previously identified and resolved based on past review comments. The current implementation is correct.


524-527: Verify port handling during upgrades.

The code correctly handles port assignments for both initiator and target addresses. Based on the learnings, e.Status.Port is sourced from the SPDK engine and doesn't require validation.

Run the following script to verify port usage patterns in the codebase:

✅ Verification successful

Based on the search results, I can now generate the final response:

Port handling during upgrades is correctly implemented.

The code safely handles port assignments:

  • The initiator port is intentionally set to 0 as it's not needed for non-upgrade operations
  • The target port is sourced from e.Status.Port which is properly managed by the instance manager:
    • Set when instance starts running (status.Port = int(instance.Status.PortStart))
    • Reset to 0 when instance stops/fails
    • Validated before use (if e.Status.IP == "" || e.Status.Port == 0 { return nil, fmt.Errorf("require IP and Port") })
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent port handling patterns across the codebase

# Check for other occurrences of JoinHostPort with port 0
rg -A 2 "JoinHostPort.*,\s*0\s*\)"

# Check for SPDK engine port assignments
rg "Status\.Port\s*=.*SPDK"

Length of output: 66


Script:

#!/bin/bash
# Let's try a broader search for port handling patterns

# Check for JoinHostPort usage patterns
rg -A 2 "JoinHostPort"

# Check for port assignments and validations
rg -A 2 "\.Port\s*="

# Check for SPDK related port configurations
rg -A 2 "SPDK.*[Pp]ort"

# Look for port-related constants or configurations
rg -A 2 "DefaultPort|PORT"

Length of output: 110179

k8s/crds.yaml (3)

1313-1414: LGTM: Well-structured CRDs for data engine upgrade management

The new CRDs DataEngineUpgradeManager and NodeDataEngineUpgrade are well-designed with:

  • Clear separation of cluster-wide and node-level upgrade management
  • Comprehensive status tracking with conditions
  • Proper validation constraints
  • Informative printer columns for kubectl

Also applies to: 2364-2452


1372-1376: LGTM: Consistent data engine field additions

The dataEngine field has been consistently added across CRDs with:

  • Proper enum validation (v1, v2)
  • Clear deprecation notices for old fields
  • Consistent field placement and documentation

Also applies to: 2419-2420


1791-1792: LGTM: Proper target node support added

The targetNodeID field has been consistently added to relevant CRDs to support node targeting during upgrades:

  • Present in both spec and status sections for proper state tracking
  • Consistent field naming across CRDs

Also applies to: 3237-3238, 3280-3281

controller/volume_controller.go (5)

1007-1012: LGTM! Clean handling of replica image differences between v1 and v2 data engines

The code correctly handles the different image requirements between v1 and v2 data engines. For v2 volumes, replicas can have different images from the volume, which is properly validated here.


1827-1836: LGTM! Proper replica startup handling for v2 data engine

The code correctly handles replica startup states for both v1 and v2 data engines, with appropriate image validation.


1923-1930: LGTM! Proper engine state transition handling for v2 data engine

The code correctly manages engine state transitions during upgrades, with appropriate checks for image and target node alignment.


3817-3822: LGTM! Proper replica image handling during creation

The code correctly handles replica image assignment based on data engine type, with v2 replicas properly using the instance manager image.


5086-5094: LGTM! Comprehensive upgrade status check

The code properly determines if a volume is upgrading by checking both image updates and target node changes for v2 data engine.

datastore/longhorn.go (4)

Line range hint 3742-3782: Improved instance manager lookup with remote node support

The changes to GetInstanceManagerByInstance properly handle remote node instances by using TargetNodeID instead of NodeID when isInstanceOnRemoteNode is true. This is essential for proper instance manager selection during operations like live migration.


3784-3841: Well-structured helper functions for instance manager operations

The new helper functions listInstanceManagers and filterInstanceManagers provide clean separation of concerns:

  • listInstanceManagers handles data engine specific logic
  • filterInstanceManagers handles ambiguous selection cases

5641-5875: Comprehensive implementation of DataEngineUpgrade CRD operations

The new methods for DataEngineUpgradeManager and NodeDataEngineUpgrade CRDs follow consistent patterns and best practices:

  • Proper verification of creation/updates
  • Finalizer handling
  • Status updates
  • List operations with RO variants

5998-6005: Clean helper method for checking node upgrade requests

The IsNodeDataEngineUpgradeRequested helper provides a simple way to check if a node has requested a data engine upgrade.

webhook/resources/nodedataengineupgrade/validator.go (1)

94-95: LGTM!

The validator returns nil appropriately when all validations pass.

controller/upgrade_manager_controller.go (4)

135-139: Undefined variable maxRetries

The variable maxRetries is used but not defined in this file or imported packages, which will result in a compile-time error.


136-142: Undefined function handleReconcileErrorLogging

The function handleReconcileErrorLogging is called but not defined in this file or imported packages, leading to a compile-time error.


202-209: Ensure proper cleanup before removing the finalizer

When handling deletion, the controller removes the finalizer before confirming that all resources, such as the monitor, are fully cleaned up, which could lead to lingering resources.


263-275: Potential race condition accessing uc.dataEngineUpgradeManagerMonitor

Access to uc.dataEngineUpgradeManagerMonitor is not synchronized, which could lead to race conditions if the controller is running multiple workers.

controller/replica_controller.go (3)

Line range hint 636-672: Ensure all calls to GetInstance are updated with the new parameter

The method GetInstance now includes an additional parameter isInstanceOnRemoteNode. Please verify that all calls to this method in the codebase have been updated to include this parameter with the correct value.

Run the following script to find all calls to GetInstance and check their parameters:

✅ Verification successful

Based on the rg search results, I can generate the final response as we have found all the occurrences of GetInstance in the codebase.

All calls to GetInstance are properly updated with the new parameter

The codebase search shows that all implementations and calls to GetInstance have been correctly updated to include the isInstanceOnRemoteNode parameter:

  • Interface definition in instance_handler.go:38
  • Implementation in replica_controller.go:636
  • Implementation in engine_controller.go:839
  • Mock implementation in instance_handler_test.go:39
  • All calls include the boolean parameter with appropriate values:
    • false for local node operations
    • true for remote node operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all calls to `GetInstance` and verify they use the new parameter.
# Expectation: All calls should pass the `isInstanceOnRemoteNode` argument.

rg --type go 'GetInstance\(' -n

Length of output: 1349


Line range hint 316-355: Ensure all calls to CreateInstance are updated with the new parameter

The method CreateInstance now includes an additional parameter isInstanceOnRemoteNode. Please verify that all calls to this method throughout the codebase have been updated to pass the appropriate value for isInstanceOnRemoteNode to ensure correct functionality.

Run the following script to find all usages of CreateInstance and check their parameters:

✅ Verification successful

Let me analyze the results. The search shows all occurrences of CreateInstance in the codebase. I can see that:

  1. The matches in vendor/github.com/go-ole/go-ole/ are unrelated as they belong to a different package and have different signatures.
  2. The relevant matches show:
    • Interface definition in controller/instance_handler.go:39
    • Mock implementation in controller/instance_handler_test.go:51
    • Actual implementations in:
      • controller/replica_controller.go:316
      • controller/engine_controller.go:469
    • Usage in controller/instance_handler.go with proper boolean arguments:
      • Line 750: CreateInstance(obj, false)
      • Line 763: CreateInstance(obj, true)
      • Line 783: CreateInstance(obj, false)

All calls to CreateInstance are properly updated with the new parameter

All implementations and calls to CreateInstance in the codebase have been correctly updated to include the isInstanceOnRemoteNode parameter. The interface definition, mock implementation, actual implementations, and all usage sites pass the appropriate boolean value.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all calls to `CreateInstance` and verify they use the new parameter.
# Expectation: All calls should pass the `isInstanceOnRemoteNode` argument.

rg --type go 'CreateInstance\(' -n

Length of output: 1461


Line range hint 528-537: Potential logical error: Passing false for isInstanceOnRemoteNode in DeleteInstance

In the DeleteInstance method, the call to GetInstanceManagerByInstance uses false as the isInstanceOnRemoteNode parameter. If the replica instance might be on a remote node, passing false could lead to incorrect instance manager retrieval and potential errors. Please verify whether the correct value should be determined based on the replica's location.

Run the following script to check usage of GetInstanceManagerByInstance and how isInstanceOnRemoteNode is set:

controller/monitor/node_upgrade_monitor.go (2)

192-206: The issue regarding variable shadowing of err within the deferred function is still present. Please refer to the previous review comment for details on how to address this.


293-293: Deferring engineClientProxy.Close() inside a loop can lead to resource exhaustion due to postponed closures. The concern raised in the previous review comment remains applicable.

k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (6)

981-1007: DeepCopy Functions for DataEngineUpgradeManager Added Successfully

The autogenerated DeepCopy methods for the DataEngineUpgradeManager type and its related functions appear correctly implemented. These functions will ensure proper deep copying of the custom resource objects.


1010-1041: DeepCopy Functions for DataEngineUpgradeManagerList Implemented Correctly

The added deepcopy methods for DataEngineUpgradeManagerList are correctly generated. This ensures that lists of DataEngineUpgradeManager objects can be deep copied without issues.


1064-1092: DeepCopy Methods for DataEngineUpgradeManagerSpec and DataEngineUpgradeManagerStatus

The DeepCopyInto and DeepCopy functions for both DataEngineUpgradeManagerSpec and DataEngineUpgradeManagerStatus are properly defined, including deep copying of slices and maps. This maintains data integrity during object copying.


1783-1842: DeepCopy Functions for NodeDataEngineUpgrade and Related Types

The deepcopy implementations for NodeDataEngineUpgrade, NodeDataEngineUpgradeList, NodeDataEngineUpgradeSpec, and NodeDataEngineUpgradeStatus are correctly generated. They handle the nested structures appropriately.


2991-3006: DeepCopy Methods for UpgradeNodeStatus

The DeepCopyInto and DeepCopy methods for UpgradeNodeStatus are correctly implemented, ensuring accurate copying of these status objects.


3326-3341: DeepCopy Functions for VolumeUpgradeStatus

The added deepcopy functions for VolumeUpgradeStatus are appropriate and correctly handle the copying process.

engineapi/instance_manager.go Outdated Show resolved Hide resolved
engineapi/instance_manager.go Outdated Show resolved Hide resolved
engineapi/instance_manager.go Outdated Show resolved Hide resolved
webhook/resources/nodedataengineupgrade/validator.go Outdated Show resolved Hide resolved
Comment on lines 235 to 255
status, ok := data.(*longhorn.DataEngineUpgradeManagerStatus)
if !ok {
log.Errorf("Failed to assert value from dataEngineUpgradeManager monitor: %v", data)
} else {
upgradeManager.Status.InstanceManagerImage = status.InstanceManagerImage
upgradeManager.Status.State = status.State
upgradeManager.Status.Message = status.Message
upgradeManager.Status.UpgradingNode = status.UpgradingNode
upgradeManager.Status.UpgradeNodes = make(map[string]*longhorn.UpgradeNodeStatus)
for k, v := range status.UpgradeNodes {
upgradeManager.Status.UpgradeNodes[k] = &longhorn.UpgradeNodeStatus{
State: v.State,
Message: v.Message,
}
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle type assertion failure when retrieving status data

In the reconcile function, if the type assertion fails (if !ok), the function logs an error but continues execution. This may lead to inconsistent state or further errors due to uninitialized fields.

Consider handling the error appropriately by returning an error or stopping further processing to prevent potential issues.

Apply this diff to return an error when the type assertion fails:

 func (uc *DataEngineUpgradeManagerController) reconcile(upgradeManagerName string) (err error) {
     // ...
     if uc.dataEngineUpgradeManagerMonitor != nil {
         data, _ := uc.dataEngineUpgradeManagerMonitor.GetCollectedData()
         status, ok := data.(*longhorn.DataEngineUpgradeManagerStatus)
         if !ok {
             log.Errorf("Failed to assert value from dataEngineUpgradeManager monitor: %v", data)
+            return fmt.Errorf("failed to retrieve valid DataEngineUpgradeManagerStatus")
         } else {
             upgradeManager.Status.InstanceManagerImage = status.InstanceManagerImage
             // existing code continues
         }
     }
     // ...
 }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +611 to +630
func (rc *ReplicaController) SuspendInstance(obj interface{}) error {
return nil
}

func (rc *ReplicaController) ResumeInstance(obj interface{}) error {
return nil
}

func (rc *ReplicaController) SwitchOverTarget(obj interface{}) error {
return nil
}

func (rc *ReplicaController) DeleteTarget(obj interface{}) error {
return nil
}

func (rc *ReplicaController) RequireRemoteTargetInstance(obj interface{}) (bool, error) {
return false, nil
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement the logic for the newly added methods

The methods SuspendInstance, ResumeInstance, SwitchOverTarget, DeleteTarget, and RequireRemoteTargetInstance have been added but currently have empty implementations that return default values. If these methods are intended to handle critical functionality during live upgrades, their lack of implementation could lead to unintended behavior. Please implement the necessary logic or provide justifications for the empty implementations.

controller/monitor/node_upgrade_monitor.go Outdated Show resolved Hide resolved
@derekbit derekbit force-pushed the v2-control-upgrade branch 3 times, most recently from f03bb50 to 7819f3a Compare November 25, 2024 14:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (29)
controller/upgrade_manager_controller.go (3)

185-201: Update OwnerID in a separate function for clarity

Updating the OwnerID and handling conflicts can be refactored into a separate function to improve readability and maintainability.

Consider extracting this logic into a new method:

func (uc *DataEngineUpgradeManagerController) updateOwnerID(upgradeManager *longhorn.DataEngineUpgradeManager, log *logrus.Entry) error {
    if upgradeManager.Status.OwnerID != uc.controllerID {
        upgradeManager.Status.OwnerID = uc.controllerID
        updatedUpgradeManager, err := uc.ds.UpdateDataEngineUpgradeManagerStatus(upgradeManager)
        if err != nil {
            if apierrors.IsConflict(errors.Cause(err)) {
                return nil
            }
            return err
        }
        *upgradeManager = *updatedUpgradeManager
        log.Infof("DataEngineUpgradeManager resource %v got new owner %v", upgradeManager.Name, uc.controllerID)
    }
    return nil
}

Then, in your reconcile method, replace the ownership update block with a call to this new function.


117-126: Optimize work item processing loop

In the worker function, consider adding a cancellation context or a mechanism to stop the goroutine more gracefully when the stop channel is closed.

This can help prevent potential goroutine leaks during shutdown.


183-184: Improve logging by including context

The logger initialized in the reconcile function could include additional context, such as the namespace or controller ID, for better traceability.

Consider updating the logger initialization:

 func (uc *DataEngineUpgradeManagerController) reconcile(upgradeManagerName string) (err error) {
     upgradeManager, err := uc.ds.GetDataEngineUpgradeManager(upgradeManagerName)
     // ...
-    log := getLoggerForDataEngineUpgradeManager(uc.logger, upgradeManager)
+    log := getLoggerForDataEngineUpgradeManager(uc.logger, upgradeManager).WithFields(
+        logrus.Fields{
+            "namespace":    uc.namespace,
+            "controllerID": uc.controllerID,
+        },
+    )
     // ...
 }
webhook/resources/volume/validator.go (2)

104-104: Clarify the error message for empty EngineImage

The error message could be more user-friendly. Instead of stating "BUG: Invalid empty Setting.EngineImage," consider rephrasing to guide the user on providing a valid engine image.

Apply this diff to improve the error message:

-return werror.NewInvalidError("BUG: Invalid empty Setting.EngineImage", "spec.image")
+return werror.NewInvalidError("spec.image must be specified and cannot be empty", "spec.image")

165-177: Simplify redundant checks for volume.Spec.NodeID

The condition if volume.Spec.NodeID != "" is checked twice within the nested if statements. This redundancy can be eliminated for clarity.

Apply this diff to remove the redundant check:

 if volume.Spec.NodeID != "" {
     node, err := v.ds.GetNodeRO(volume.Spec.NodeID)
     if err != nil {
         err = errors.Wrapf(err, "failed to get node %v", volume.Spec.NodeID)
         return werror.NewInternalError(err.Error())
     }

     if node.Spec.DataEngineUpgradeRequested {
-        if volume.Spec.NodeID != "" {
             return werror.NewInvalidError(fmt.Sprintf("volume %v is not allowed to attach to node %v during v2 data engine upgrade",
                 volume.Name, volume.Spec.NodeID), "spec.nodeID")
-        }
     }
 }
datastore/longhorn.go (2)

3995-3998: Refactor to eliminate redundant empty imageName checks

The check for an empty imageName in GetDataEngineImageCLIAPIVersion is duplicated for both data engine types. Consider consolidating this check to reduce code duplication and improve readability.

Apply this diff to refactor the function:

+    if imageName == "" {
+        return -1, fmt.Errorf("cannot check the CLI API Version based on empty image name")
+    }

     if types.IsDataEngineV2(dataEngine) {
-        if imageName == "" {
-            return -1, fmt.Errorf("cannot check the CLI API Version based on empty image name")
-        }
         return 0, nil
     }

-    if imageName == "" {
-        return -1, fmt.Errorf("cannot check the CLI API Version based on empty image name")
-    }

     ei, err := s.GetEngineImageRO(types.GetEngineImageChecksumName(imageName))
     if err != nil {
         return -1, errors.Wrapf(err, "failed to get engine image object based on image name %v", imageName)
     }

5641-5751: Add unit tests for DataEngineUpgradeManager methods

The new methods related to DataEngineUpgradeManager enhance upgrade functionality. To maintain code reliability and prevent regressions, please add unit tests covering these methods.

k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (1)

76-94: Consider adding type assertion error handling

While the implementation is correct, consider adding error handling for type assertions to improve robustness:

 func (s dataEngineUpgradeManagerNamespaceLister) List(selector labels.Selector) (ret []*v1beta2.DataEngineUpgradeManager, err error) {
 	err = cache.ListAllByNamespace(s.indexer, s.namespace, selector, func(m interface{}) {
-		ret = append(ret, m.(*v1beta2.DataEngineUpgradeManager))
+		if obj, ok := m.(*v1beta2.DataEngineUpgradeManager); ok {
+			ret = append(ret, obj)
+		}
 	})
 	return ret, err
 }

 func (s dataEngineUpgradeManagerNamespaceLister) Get(name string) (*v1beta2.DataEngineUpgradeManager, error) {
 	obj, exists, err := s.indexer.GetByKey(s.namespace + "/" + name)
 	if err != nil {
 		return nil, err
 	}
 	if !exists {
 		return nil, errors.NewNotFound(v1beta2.Resource("dataengineupgrademanager"), name)
 	}
-	return obj.(*v1beta2.DataEngineUpgradeManager), nil
+	manager, ok := obj.(*v1beta2.DataEngineUpgradeManager)
+	if !ok {
+		return nil, fmt.Errorf("cached object is not a *v1beta2.DataEngineUpgradeManager")
+	}
+	return manager, nil
 }
k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (2)

67-195: Consider enhancing error handling with wrapped errors.

While the implementation follows standard Kubernetes client patterns, consider wrapping errors with additional context to aid in debugging. For example:

 func (c *nodeDataEngineUpgrades) Get(ctx context.Context, name string, options v1.GetOptions) (result *v1beta2.NodeDataEngineUpgrade, err error) {
 	result = &v1beta2.NodeDataEngineUpgrade{}
 	err = c.client.Get().
 		Namespace(c.ns).
 		Resource("nodedataengineupgrades").
 		Name(name).
 		VersionedParams(&options, scheme.ParameterCodec).
 		Do(ctx).
 		Into(result)
+	if err != nil {
+		return nil, fmt.Errorf("failed to get NodeDataEngineUpgrade %s/%s: %w", c.ns, name, err)
+	}
 	return
 }

17-17: Note: This is an auto-generated file.

Any suggested changes would need to be made to the generation templates rather than directly to this file, as direct changes would be lost on the next code generation.

controller/node_upgrade_controller.go (3)

57-57: Address TODO comment regarding client wrapper

The TODO comment suggests there's technical debt related to client usage that needs to be addressed.

Would you like me to help track this by creating a GitHub issue to remove the event broadcaster wrapper once all clients have moved to use the clientset?


154-158: Add documentation for isResponsibleFor method

Consider adding a documentation comment explaining:

  • The purpose of the preferred owner ID
  • The relationship with the node ID
  • The conditions under which a controller is considered responsible

235-244: Improve monitor lifecycle management

The monitor cleanup logic is spread across multiple places and could lead to resource leaks. Consider:

  1. Extracting monitor cleanup into a separate method
  2. Adding error handling for Close() operations
  3. Ensuring cleanup happens in all error paths
+func (uc *NodeDataEngineUpgradeController) cleanupMonitor() {
+    if uc.nodeDataEngineUpgradeMonitor != nil {
+        if err := uc.nodeDataEngineUpgradeMonitor.Close(); err != nil {
+            uc.logger.WithError(err).Warn("Failed to close node data engine upgrade monitor")
+        }
+        uc.nodeDataEngineUpgradeMonitor = nil
+    }
+}

 if nodeUpgrade.Status.State == longhorn.UpgradeStateCompleted ||
     nodeUpgrade.Status.State == longhorn.UpgradeStateError {
     uc.updateNodeDataEngineUpgradeStatus(nodeUpgrade)
-    uc.nodeDataEngineUpgradeMonitor.Close()
-    uc.nodeDataEngineUpgradeMonitor = nil
+    uc.cleanupMonitor()
 }
controller/replica_controller.go (1)

528-528: Consider parameterizing the remote node flag.

The hardcoded false for isInstanceOnRemoteNode might need to be parameterized for consistency with other methods.

-im, err = rc.ds.GetInstanceManagerByInstance(obj, false)
+im, err = rc.ds.GetInstanceManagerByInstance(obj, isInstanceOnRemoteNode)
controller/monitor/node_upgrade_monitor.go (3)

24-24: Consider documenting the rationale for the sync period value.

Adding a comment explaining why 3 seconds was chosen as the sync period would help future maintainers understand the timing considerations.


108-137: Consider enhancing error handling in the run method.

The error from handleNodeUpgrade is not captured or logged. While the error is propagated through status updates, adding explicit error logging would help with debugging.

Apply this diff:

 func (m *NodeDataEngineUpgradeMonitor) run(value interface{}) error {
     nodeUpgrade, err := m.ds.GetNodeDataEngineUpgrade(m.nodeUpgradeName)
     if err != nil {
         return errors.Wrapf(err, "failed to get longhorn nodeDataEngineUpgrade %v", m.nodeUpgradeName)
     }

     existingNodeUpgradeStatus := m.nodeUpgradeStatus.DeepCopy()

-    m.handleNodeUpgrade(nodeUpgrade)
+    if err := m.handleNodeUpgrade(nodeUpgrade); err != nil {
+        m.logger.WithError(err).Errorf("Failed to handle node upgrade %v", m.nodeUpgradeName)
+    }

641-683: Consider optimizing node selection strategy.

The current implementation selects the first available node and then potentially updates it if a node with completed upgrade is found. This could be optimized to:

  1. Use a single pass through the nodes
  2. Consider additional factors like node capacity and current load
  3. Implement a more sophisticated scoring mechanism for node selection

Here's a suggested implementation:

 func (m *NodeDataEngineUpgradeMonitor) findAvailableNodeForTargetInstanceReplacement(nodeUpgrade *longhorn.NodeDataEngineUpgrade) (string, error) {
     upgradeManager, err := m.ds.GetDataEngineUpgradeManager(nodeUpgrade.Spec.DataEngineUpgradeManager)
     if err != nil {
         return "", err
     }

     ims, err := m.ds.ListInstanceManagersBySelectorRO("", "", longhorn.InstanceManagerTypeAllInOne, longhorn.DataEngineTypeV2)
     if err != nil {
         return "", err
     }

-    availableNode := ""
+    type nodeScore struct {
+        nodeID string
+        score  int
+    }
+    var bestNode nodeScore

     for _, im := range ims {
         if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
             continue
         }

         if im.Spec.NodeID == nodeUpgrade.Status.OwnerID {
             continue
         }

-        if availableNode == "" {
-            availableNode = im.Spec.NodeID
+        score := 0
+        
+        // Base score for running instance manager
+        score += 1
+
+        node, err := m.ds.GetNode(im.Spec.NodeID)
+        if err != nil {
+            continue
+        }
+
+        // Prefer nodes with more available resources
+        if condition := types.GetCondition(node.Status.Conditions, longhorn.NodeConditionTypeSchedulable); condition.Status == longhorn.ConditionStatusTrue {
+            score += 2
         }

         upgradeNodeStatus, ok := upgradeManager.Status.UpgradeNodes[im.Spec.NodeID]
-        if !ok {
-            continue
+        if ok && upgradeNodeStatus.State == longhorn.UpgradeStateCompleted {
+            score += 4
         }

-        // Prefer the node that has completed the upgrade
-        if upgradeNodeStatus.State == longhorn.UpgradeStateCompleted {
-            availableNode = im.Spec.NodeID
-            break
+        if score > bestNode.score {
+            bestNode = nodeScore{
+                nodeID: im.Spec.NodeID,
+                score:  score,
+            }
         }
     }

-    if availableNode == "" {
+    if bestNode.nodeID == "" {
         return "", fmt.Errorf("failed to find available node for target")
     }

-    return availableNode, nil
+    return bestNode.nodeID, nil
 }
engineapi/instance_manager.go (1)

532-555: Enhance error handling and add validation in getReplicaAddresses

While the core logic is sound, consider the following improvements:

  1. Make error messages more descriptive by including the problematic address
  2. Add validation for empty input addresses
  3. Document or handle the edge case when all replicas are filtered out
 func getReplicaAddresses(replicaAddresses map[string]string, initiatorAddress, targetAddress string) (map[string]string, error) {
+	if initiatorAddress == "" || targetAddress == "" {
+		return nil, errors.New("initiator and target addresses are required")
+	}
+
 	initiatorIP, _, err := net.SplitHostPort(initiatorAddress)
 	if err != nil {
-		return nil, errors.New("invalid initiator address format")
+		return nil, errors.Errorf("invalid initiator address format: %v", initiatorAddress)
 	}
 
 	targetIP, _, err := net.SplitHostPort(targetAddress)
 	if err != nil {
-		return nil, errors.New("invalid target address format")
+		return nil, errors.Errorf("invalid target address format: %v", targetAddress)
 	}
 
 	addresses := make(map[string]string)
 	for name, addr := range replicaAddresses {
 		replicaIP, _, err := net.SplitHostPort(addr)
 		if err != nil {
-			return nil, errors.New("invalid replica address format")
+			return nil, errors.Errorf("invalid replica address format for %v: %v", name, addr)
 		}
 		if initiatorIP != targetIP && initiatorIP == replicaIP {
 			continue
 		}
 		addresses[name] = addr
 	}
+	if len(addresses) == 0 {
+		return nil, errors.New("no valid replica addresses found after filtering")
+	}
 	return addresses, nil
 }
controller/backup_controller.go (1)

599-607: LGTM! Consider adding a comment explaining the rationale.

The code correctly prevents nodes from being responsible for backups when they have a data engine upgrade requested, which is essential for safe live upgrades.

Consider adding a comment explaining why we skip backup responsibility during data engine upgrades:

+	// Skip backup responsibility when a data engine upgrade is requested
+	// to prevent potential issues during the upgrade process
 	if node.Spec.DataEngineUpgradeRequested {
 		return false, nil
 	}
controller/instance_handler.go (3)

58-165: Consider refactoring the status sync method

The syncStatusIPsAndPorts method is quite long and handles multiple responsibilities. Consider breaking it down into smaller, focused methods:

  • syncInitiatorStatus
  • syncTargetStatus
  • syncStorageIPs

This would improve readability and maintainability.


716-790: Enhance error handling in instance creation

The instance creation logic handles both v1 and v2 data engines well, but consider:

  1. Adding more context to error messages
  2. Using structured logging with fields
  3. Adding metrics for instance creation success/failure

883-995: Add documentation for v2 data engine helper methods

The new helper methods lack documentation explaining their purpose and behavior. Consider adding detailed comments for:

  • isVolumeBeingSwitchedBack
  • isTargetInstanceReplacementCreated
  • isTargetInstanceRemote
  • isDataEngineNotBeingLiveUpgraded

This will help other developers understand the v2 data engine upgrade flow.

controller/node_controller.go (1)

Line range hint 2177-2209: LGTM! Consider adding error handling for condition updates.

The implementation for handling data engine upgrades looks good. The code properly disables scheduling during upgrades and provides clear status messages.

Consider adding error handling for the SetConditionAndRecord calls to handle potential errors during condition updates. For example:

-		node.Status.Conditions =
-			types.SetConditionAndRecord(node.Status.Conditions,
+		conditions, err := types.SetConditionAndRecord(node.Status.Conditions,
 				longhorn.NodeConditionTypeSchedulable,
 				longhorn.ConditionStatusFalse,
 				reason,
 				message,
 				nc.eventRecorder,
 				node,
 				corev1.EventTypeNormal)
+		if err != nil {
+			return err
+		}
+		node.Status.Conditions = conditions
controller/engine_controller.go (2)

437-467: LGTM with a minor suggestion for error handling improvement

The method effectively handles instance manager and IP resolution for both initiator and target instances. Consider adding more specific error messages for common failure cases.

 func (ec *EngineController) findInstanceManagerAndIPs(obj interface{}) (im *longhorn.InstanceManager, initiatorIP string, targetIP string, err error) {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
-        return nil, "", "", fmt.Errorf("invalid object for engine: %v", obj)
+        return nil, "", "", fmt.Errorf("expected Engine type but got %T", obj)
     }

     initiatorIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
     if err != nil {
-        return nil, "", "", err
+        return nil, "", "", errors.Wrapf(err, "failed to get initiator instance manager for engine %v", e.Name)
     }

704-760: LGTM with suggestion for improved modularity

The method effectively handles target switchover with proper validation and logging. Consider breaking down the port handling logic into a separate helper method for better maintainability.

Consider extracting port handling logic:

+func (ec *EngineController) getTargetPort(targetInstance *longhorn.InstanceProcess) int {
+    port := targetInstance.Status.TargetPortStart
+    if targetInstance.Status.StandbyTargetPortStart != 0 {
+        port = targetInstance.Status.StandbyTargetPortStart
+    }
+    return port
+}

 func (ec *EngineController) SwitchOverTarget(obj interface{}) error {
     // ... existing validation code ...
     
-    port := targetInstance.Status.TargetPortStart
-    if targetInstance.Status.StandbyTargetPortStart != 0 {
-        port = targetInstance.Status.StandbyTargetPortStart
-    }
+    port := ec.getTargetPort(targetInstance)
     
     log.Infof("Switching over target to %v:%v", targetIM.Status.IP, port)
     // ... rest of the code ...
k8s/crds.yaml (2)

1313-1414: Enhance field descriptions in DataEngineUpgradeManager CRD

The DataEngineUpgradeManager CRD structure is good, but some fields would benefit from more detailed descriptions:

  1. spec.nodes could clarify the behavior when nodes are added/removed during upgrade
  2. status.upgradeNodes could explain the state transitions
  3. status.instanceManagerImage is missing a description
              nodes:
                description: |-
                  Nodes specifies the list of nodes to perform the data engine upgrade on.
                  If empty, the upgrade will be performed on all available nodes.
+                 Adding or removing nodes during an upgrade may affect the upgrade process.
                items:
                  type: string
                type: array
              instanceManagerImage:
+               description: The instance manager image used for the data engine upgrade.
                type: string
              upgradeNodes:
                additionalProperties:
                  description: |-
                    UpgradeState defines the state of the node upgrade process
+                   States can transition from "pending" -> "in-progress" -> "completed"/"failed"

2581-2583: Enhance documentation for Node upgrade field

The dataEngineUpgradeRequested field description could be more detailed to help operators understand its implications.

              dataEngineUpgradeRequested:
-               description: Request to upgrade the instance manager for v2 volumes on the node.
+               description: |-
+                 Request to upgrade the instance manager for v2 volumes on the node.
+                 When set to true, the node will be scheduled for data engine upgrade.
+                 The upgrade process will only proceed if there are no ongoing volume operations.
+                 This field should not be modified while an upgrade is in progress.
                type: boolean
controller/volume_controller.go (2)

1923-1930: Add error handling for engine state transitions

The code correctly handles engine state transitions for v2 data engine by checking both image and target node ID. However, consider adding error handling for unexpected states.

 if types.IsDataEngineV1(v.Spec.DataEngine) {
   e.Spec.DesireState = longhorn.InstanceStateRunning
 } else {
   if v.Spec.Image == v.Status.CurrentImage && 
      v.Spec.TargetNodeID == v.Status.CurrentTargetNodeID {
     e.Spec.DesireState = longhorn.InstanceStateRunning
+  } else {
+    log.Debugf("Waiting for image/target node sync before setting engine running state")
   }
 }

1619-1619: Enhance error logging for volume dependent resources

The warning log could be more descriptive to help with troubleshooting.

-log.WithField("e.Status.CurrentState", e.Status.CurrentState).Warn("Volume is attached but dependent resources are not opened")
+log.WithFields(logrus.Fields{
+  "engineState": e.Status.CurrentState,
+  "volumeState": v.Status.State,
+}).Warn("Volume is attached but engine or replica resources are not in running state")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 00e7be5 and 7819f3a.

📒 Files selected for processing (50)
  • controller/backup_controller.go (1 hunks)
  • controller/controller_manager.go (2 hunks)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • controller/monitor/upgrade_manager_monitor.go (1 hunks)
  • controller/node_controller.go (2 hunks)
  • controller/node_upgrade_controller.go (1 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/uninstall_controller.go (4 hunks)
  • controller/upgrade_manager_controller.go (1 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • datastore/datastore.go (3 hunks)
  • datastore/longhorn.go (6 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • k8s/crds.yaml (84 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (6 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/node.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/register.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/volume.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go (3 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/generic.go (2 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (4 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go (2 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
✅ Files skipped from review due to trivial changes (2)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go
🚧 Files skipped from review as they are similar to previous changes (25)
  • controller/controller_manager.go
  • controller/monitor/upgrade_manager_monitor.go
  • controller/uninstall_controller.go
  • engineapi/instance_manager_test.go
  • k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go
  • k8s/pkg/apis/longhorn/v1beta2/node.go
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/apis/longhorn/v1beta2/register.go
  • k8s/pkg/apis/longhorn/v1beta2/volume.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go
  • k8s/pkg/client/informers/externalversions/generic.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go
  • scheduler/replica_scheduler.go
  • webhook/resources/dataengineupgrademanager/mutator.go
  • webhook/resources/dataengineupgrademanager/validator.go
  • webhook/resources/nodedataengineupgrade/mutator.go
  • webhook/server/mutation.go
  • webhook/server/validation.go
🧰 Additional context used
📓 Learnings (2)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
controller/monitor/node_upgrade_monitor.go (1)
Learnt from: james-munson
PR: longhorn/longhorn-manager#3211
File: app/post_upgrade.go:102-113
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In Go, when a deferred function references a variable like `err`, ensure that the variable is declared in the outer scope and not within an inner scope (such as within `if err := ...`), to prevent compilation errors and unintended variable shadowing.
🪛 golangci-lint (1.61.0)
controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (75)
webhook/resources/nodedataengineupgrade/validator.go (2)

40-66: Validation logic in the Create method is appropriate

The Create method correctly validates all required fields for the NodeDataEngineUpgrade resource and provides explicit error messages for invalid input.


68-95: Proper enforcement of immutability in the Update method

The Update method effectively ensures that critical fields (NodeID, DataEngine, InstanceManagerImage, and DataEngineUpgradeManager) remain immutable during updates, maintaining the integrity of the resource.

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (14)

17-17: Acknowledgment: Code is auto-generated by client-gen

As this file is generated by client-gen, it should not be manually edited to prevent inconsistencies.


33-37: Definition of DataEngineUpgradeManagersGetter interface is appropriate

The getter interface is correctly defined, following standard client-go patterns.


39-51: DataEngineUpgradeManagerInterface includes all necessary CRUD methods

All essential methods for managing DataEngineUpgradeManager resources are appropriately declared.


53-57: dataEngineUpgradeManagers struct encapsulates client and namespace effectively

The struct correctly holds the REST client and namespace for resource operations.


59-65: Constructor newDataEngineUpgradeManagers initializes the client properly

The function correctly returns a new instance with the REST client and specified namespace.


67-78: Get method retrieves resources accurately

The implementation properly fetches a single DataEngineUpgradeManager resource by name.


80-95: List method handles resource listing with timeout considerations

The method correctly lists resources based on label and field selectors, handling timeouts if specified.


97-110: Watch method establishes resource watches effectively

The implementation allows for watching changes to DataEngineUpgradeManager resources, supporting timeouts.


112-123: Create method correctly handles resource creation

The method accurately creates a new DataEngineUpgradeManager resource with the provided specifications.


125-137: Update method properly updates existing resources

Resource updates are correctly handled, ensuring the server-side representation is returned.


139-153: UpdateStatus method manages status subresource updates correctly

The implementation appropriately updates the status of a DataEngineUpgradeManager resource.


155-164: Delete method removes resources as expected

The method correctly deletes a specified resource by name.


166-180: DeleteCollection method handles bulk deletion effectively

The implementation allows for deleting a collection of resources, considering timeout settings.


182-195: Patch method applies modifications to resources accurately

The method correctly applies patches to resources, supporting various patch types and subresources.

controller/upgrade_manager_controller.go (5)

211-214: Handle terminal states appropriately

When the upgrade manager reaches a terminal state (UpgradeStateCompleted or UpgradeStateError), the controller stops processing. Ensure that any necessary cleanup or final actions are performed before exiting.

Confirm that all resources are properly finalized and that no further action is required.


74-84: ⚠️ Potential issue

Add error handling for event handler registration

When adding the event handler with ds.DataEngineUpgradeManagerInformer.AddEventHandler, the returned error should be handled to catch any issues during the handler setup.

Apply this diff to handle the error appropriately:

 if _, err := ds.DataEngineUpgradeManagerInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
     AddFunc:    uc.enqueueDataEngineUpgradeManager,
     UpdateFunc: func(old, cur interface{}) { uc.enqueueDataEngineUpgradeManager(cur) },
     DeleteFunc: uc.enqueueDataEngineUpgradeManager,
 }); err != nil {
     return nil, err
+    // Consider logging the error if necessary
 }

Ensure that any setup errors are not silently ignored.

Likely invalid or redundant comment.


235-251: ⚠️ Potential issue

Handle type assertion failure appropriately

In the reconcile function, if the type assertion fails (if !ok), the code logs an error but continues execution. This may lead to a nil pointer dereference or inconsistent state.

Apply this diff to return an error when the type assertion fails:

 func (uc *DataEngineUpgradeManagerController) reconcile(upgradeManagerName string) (err error) {
     // ...
     if uc.dataEngineUpgradeManagerMonitor != nil {
         data, _ := uc.dataEngineUpgradeManagerMonitor.GetCollectedData()
         status, ok := data.(*longhorn.DataEngineUpgradeManagerStatus)
         if !ok {
             log.Errorf("Failed to assert value from dataEngineUpgradeManager monitor: %v", data)
+            return fmt.Errorf("failed to retrieve valid DataEngineUpgradeManagerStatus")
         } else {
             upgradeManager.Status.InstanceManagerImage = status.InstanceManagerImage
             // existing code continues
         }
     }
     // ...
 }

Likely invalid or redundant comment.


263-275: ⚠️ Potential issue

Potential race condition accessing uc.dataEngineUpgradeManagerMonitor

Access to uc.dataEngineUpgradeManagerMonitor is not synchronized. If the controller runs multiple workers, this could lead to race conditions.

Apply this diff to add a mutex for synchronization:

+import "sync"

 type DataEngineUpgradeManagerController struct {
     // ...
     dataEngineUpgradeManagerMonitor monitor.Monitor
+    monitorMutex                    sync.Mutex
     // ...
 }

 func (uc *DataEngineUpgradeManagerController) createDataEngineUpgradeManagerMonitor(upgradeManager *longhorn.DataEngineUpgradeManager) (monitor.Monitor, error) {
+    uc.monitorMutex.Lock()
+    defer uc.monitorMutex.Unlock()

     if uc.dataEngineUpgradeManagerMonitor != nil {
         return uc.dataEngineUpgradeManagerMonitor, nil
     }
     // ...
 }

Likely invalid or redundant comment.


218-227: Handle status update conflicts properly to prevent requeue loops

In the deferred function, when updating the status, if there is a conflict, you re-enqueue the item. Ensure that this does not lead to infinite requeues and that the controller respects the maxRetries limit.

Review the conflict handling logic to confirm it aligns with the desired retry mechanism. Run the following script to verify the requeue behavior:

✅ Verification successful

The conflict handling is properly implemented with retry limits

The verification shows that:

  1. The controller has a global maxRetries = 3 limit defined in base_controller.go
  2. When conflicts occur during status updates, the item is requeued using enqueueDataEngineUpgradeManager which adds to the normal queue (not rate-limited)
  3. The base error handling checks NumRequeues(key) < maxRetries and uses rate-limited requeues for general reconciliation errors
  4. Status update conflicts use a separate, more lenient requeue mechanism which is appropriate since conflicts are temporary race conditions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that items are not endlessly requeued due to conflicts.

# Expected: Items should be requeued up to maxRetries times.

grep -A 10 'func (uc *DataEngineUpgradeManagerController) handleErr' controller/upgrade_manager_controller.go

Length of output: 111


Script:

#!/bin/bash
# Let's search for retry-related code in the controller
rg -A 5 -B 5 'maxRetries|requeue|Requeue' controller/upgrade_manager_controller.go

Length of output: 768


Script:

#!/bin/bash
# Let's also check the controller struct definition for retry-related fields
ast-grep --pattern 'type DataEngineUpgradeManagerController struct {
  $$$
}'

Length of output: 1207


Script:

#!/bin/bash
# Let's check the reconcile function to understand the error handling flow
rg -A 10 'func \(uc \*DataEngineUpgradeManagerController\) reconcile' controller/upgrade_manager_controller.go

Length of output: 446


Script:

#!/bin/bash
# Let's check the baseController definition since the controller embeds it
rg -A 10 'type baseController struct' .

Length of output: 821


Script:

#!/bin/bash
# Let's also check where maxRetries is defined
rg 'maxRetries.*=' .

Length of output: 198


Script:

#!/bin/bash
# Let's check the enqueue function implementation
ast-grep --pattern 'func (uc *DataEngineUpgradeManagerController) enqueueDataEngineUpgradeManager($$$)'

Length of output: 774

webhook/resources/volume/validator.go (13)

131-131: Ensure Data Engine is enabled before creation

Validating that the required data engine is enabled is crucial. Confirm that ValidateRequiredDataEngineEnabled correctly checks the data engine status.


144-145: Verify engine image compatibility

The check for engine image compatibility is important. Ensure that CheckDataEngineImageCompatiblityByImage accurately validates the image against the specified data engine.


147-148: Restrict setting TargetNodeID during volume creation

It's appropriate to prevent setting spec.targetNodeID for a new volume. This ensures that the target node is determined during attachment.


154-156: Feature not supported: Encrypted volumes with Data Engine v2

Encrypted volumes are not supported for Data Engine v2. The validation correctly prevents this configuration.


158-160: Feature not supported: Backing images with Data Engine v2

Backing images are not supported for Data Engine v2. The validation ensures users are aware of this limitation.


162-164: Feature not supported: Clone operations with Data Engine v2

Cloning from another volume is not supported for Data Engine v2. Validation here is appropriate.


271-272: Undefined variable v in error message

The error message references v.Spec.MigrationNodeID, but v is not defined in this context. It should likely be newVolume.Spec.MigrationNodeID.


275-277: Validate SnapshotMaxCount within acceptable range

Good job validating that snapshotMaxCount is within the acceptable range. This prevents potential issues with snapshot management.


284-288: Ensure safe updating of snapshot limits

Validating changes to SnapshotMaxCount and SnapshotMaxSize helps prevent configurations that could inadvertently delete existing snapshots.


298-305: Prevent unsupported changes to BackingImage for Data Engine v2

Changing the BackingImage is not supported for volumes using Data Engine v2. The validation correctly enforces this restriction.


356-360: Handle errors when retrieving node information

Proper error handling when retrieving the node ensures that unexpected issues are surfaced appropriately.


368-369: Logical issue in condition for changing TargetNodeID

The condition checks if oldVolume.Spec.TargetNodeID == "" && oldVolume.Spec.TargetNodeID != newVolume.Spec.TargetNodeID. Since oldVolume.Spec.TargetNodeID is "", the second part oldVolume.Spec.TargetNodeID != newVolume.Spec.TargetNodeID will always be true if newVolume.Spec.TargetNodeID is not empty.

Consider revising the condition for clarity.


407-408: Restrict setting TargetNodeID for non-Data Engine v2 volumes

It's appropriate to prevent setting spec.targetNodeID when the volume is not using Data Engine v2, ensuring consistent behavior.

datastore/longhorn.go (3)

Line range hint 1801-1805: Ensure proper error handling when InstanceManagerName is empty

In the getRunningReplicaInstanceManagerRO function, when r.Status.InstanceManagerName is empty, the code calls GetInstanceManagerByInstanceRO. Please verify that GetInstanceManagerByInstanceRO can handle cases where InstanceManagerName is not set, to prevent potential nil pointer dereferences.


3761-3764: Verify assignment of nodeID when isInstanceOnRemoteNode is true

In GetInstanceManagerByInstanceRO, when isInstanceOnRemoteNode is true, nodeID is assigned from obj.Spec.TargetNodeID. Ensure that TargetNodeID is always properly set to avoid potential issues with empty or invalid nodeID values.


5848-5875: ⚠️ Potential issue

Handle errors in RemoveFinalizerForNodeDataEngineUpgrade

In RemoveFinalizerForNodeDataEngineUpgrade, if the Update call fails and DeletionTimestamp is not set, the function does not return the error. This could lead to silent failures. Ensure that errors are properly returned and logged.

Apply this diff to handle the error appropriately:

 _, err := s.lhClient.LonghornV1beta2().NodeDataEngineUpgrades(s.namespace).Update(context.TODO(), upgrade, metav1.UpdateOptions{})
 if err != nil {
     // workaround `StorageError: invalid object, Code: 4` due to empty object
     if upgrade.DeletionTimestamp != nil {
         return nil
     }
-    return nil
+    return errors.Wrapf(err, "unable to remove finalizer for nodeDataEngineUpgrade %s", upgrade.Name)
 }

Likely invalid or redundant comment.

k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (2)

28-53: LGTM! Follows Kubernetes lister patterns

The implementation correctly follows Kubernetes lister patterns:

  • Proper interface definition with List and namespace-specific methods
  • Efficient use of cache.Indexer
  • Correct handling of label selectors

17-26: Generated code verification required

This is an auto-generated file. Please ensure:

  1. The code generator configuration is correct
  2. The generator version is up-to-date with your kubernetes dependencies
✅ Verification successful

Code generation configuration is correct

The verification shows that:

  1. The package has proper code generation directives in doc.go (+k8s:deepcopy-gen=package and +groupName=longhorn.io)
  2. The generated lister follows the standard Kubernetes pattern with correct interface definition
  3. The generated code is consistent with other Kubernetes-style listers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the code generator configuration
find . -name "*.go" -type f -exec grep -l "// +k8s:deepcopy-gen" {} \;

Length of output: 12593


Script:

#!/bin/bash
# Let's verify the code generation markers and configuration for the DataEngineUpgradeManager
rg -A 5 "// +k8s:.*-gen" ./k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go

# Also check if there are any generator tags in doc.go which often contains package-level generation directives
cat ./k8s/pkg/apis/longhorn/v1beta2/doc.go

# Check if the generated lister follows the correct pattern
ast-grep --pattern 'type DataEngineUpgradeManagerLister interface {
  $$$
}'

Length of output: 1252

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (3)

1-31: LGTM: File header and imports are properly structured.

The file follows standard Kubernetes client code conventions with appropriate license header and necessary imports.


53-65: LGTM: Implementation struct and constructor are properly implemented.

The client implementation follows standard Kubernetes patterns with proper initialization.


33-51: Verify interface completeness against CRD definition.

The interfaces follow standard Kubernetes patterns and include all expected CRUD operations. However, let's verify that all custom resource operations defined in the CRD are covered.

✅ Verification successful

Based on the shell script results, I can now generate the final response:

Interface implementation is complete and matches CRD definition

The NodeDataEngineUpgrade interfaces are correctly implemented with all necessary operations. The CRD definition in k8s/crds.yaml shows:

  • Basic resource definition with standard CRUD operations
  • Status subresource support (covered by UpdateStatus method)
  • No additional custom actions or subresources defined

The interface provides all required methods for interacting with the NodeDataEngineUpgrade custom resource, including status updates.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all custom resource operations defined in the CRD are covered by the interface
# Search for CRD definition and additional subresources or custom actions

# Look for CRD definition
fd -e yaml -e yml | xargs rg -l "kind:\s*CustomResourceDefinition" | xargs rg "NodeDataEngineUpgrade" -A 10

# Look for any custom action definitions in controller
rg "NodeDataEngineUpgrade.*Action|Action.*NodeDataEngineUpgrade" -g "*.go"

Length of output: 4275

controller/node_upgrade_controller.go (1)

86-126: LGTM: Queue management implementation follows best practices

The queue management and worker implementation follows Kubernetes controller patterns correctly, with proper error handling and graceful shutdown.

datastore/datastore.go (4)

93-96: LGTM: New fields follow established patterns

The new fields for data engine upgrade management follow the consistent pattern used for other resources in the DataStore struct.


186-189: LGTM: Proper initialization of new informers

The initialization of the new informers follows the established pattern:

  1. Informer creation from the factory
  2. Cache sync registration

279-282: LGTM: Correct lister and informer assignments

The assignments of the new listers and informers in the DataStore constructor follow the consistent pattern used throughout the codebase.


48-92: Verify impact of field alignment changes

While the alignment changes improve readability, we should verify that all fields are preserved during the reformatting.

Also applies to: 234-278

✅ Verification successful

Field alignment changes verified successfully

Comparing the old and new versions of the DataStore struct shows that all fields are preserved with identical types and ordering. The alignment changes were purely cosmetic and did not affect the structure's content.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify no fields were accidentally removed during reformatting
# by comparing the old version with the new version

# Get the old version of the file
git show HEAD^:datastore/datastore.go > old_datastore.go

# Extract struct fields from both versions
echo "=== Old version fields ==="
awk '/type DataStore struct/,/}/' old_datastore.go | grep -v '^[[:space:]]*$' | grep -v '^}'
echo "=== New version fields ==="
awk '/type DataStore struct/,/}/' datastore/datastore.go | grep -v '^[[:space:]]*$' | grep -v '^}'

# Cleanup
rm old_datastore.go

Length of output: 10425

controller/instance_handler_test.go (3)

Line range hint 39-51: LGTM: Method signature updates are consistent with interface changes.

The updated signatures for GetInstance and CreateInstance correctly include the new isInstanceOnRemoteNode parameter, maintaining consistency with the interface changes.


137-141: LGTM: Engine struct field additions are properly implemented.

The new fields TargetIP, StorageIP, StorageTargetIP, and TargetPort are correctly added to the newEngine function, maintaining consistency with the engine status structure.


87-90: ⚠️ Potential issue

Inconsistent error handling in DeleteTarget method.

While other unimplemented methods in MockInstanceManagerHandler return an error to indicate they are not mocked, DeleteTarget currently returns nil.

Apply this diff to ensure consistent error handling:

func (imh *MockInstanceManagerHandler) DeleteTarget(obj interface{}) error {
    // DeleteTarget is not mocked
-   return nil
+   return fmt.Errorf("DeleteTarget is not mocked")
}

Likely invalid or redundant comment.

controller/replica_controller.go (3)

Line range hint 636-647: LGTM! Implementation is consistent.

The changes maintain consistency with CreateInstance for remote node handling and properly validate V2 data engine state.


611-630: Empty implementations need to be addressed.

These empty implementations were previously flagged in the review. The comment remains valid.


Line range hint 316-355: LGTM! Verify remote node instance handling.

The addition of isInstanceOnRemoteNode parameter enhances the instance creation logic to support remote nodes, which is essential for live upgrades.

Let's verify the remote node instance handling implementation:

✅ Verification successful

Remote node instance handling implementation is consistent and complete

The implementation of isInstanceOnRemoteNode is properly integrated across the codebase:

  • Interface definition in InstanceManagerHandler correctly includes the parameter
  • Consistent implementation in both Engine and Replica controllers
  • DataStore layer properly handles the parameter by switching between NodeID and TargetNodeID
  • Test coverage is in place with mock implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent remote node instance handling across the codebase

# Test: Search for other remote node instance handling patterns
rg -l "isInstanceOnRemoteNode"

# Test: Check for potential instance manager retrieval patterns
ast-grep --pattern 'GetInstanceManagerByInstance($$$)'

Length of output: 241


Script:

#!/bin/bash
# Let's examine the implementation details of remote node instance handling

# Check the implementation in instance_handler.go
rg "isInstanceOnRemoteNode" controller/instance_handler.go -B 2 -A 2

# Check how engine controller implements this
rg "isInstanceOnRemoteNode" controller/engine_controller.go -B 2 -A 2

# Check the datastore implementation
rg "isInstanceOnRemoteNode" datastore/longhorn.go -B 2 -A 2

# Look for any test cases
rg "isInstanceOnRemoteNode" controller/instance_handler_test.go -B 2 -A 2

Length of output: 2346

engineapi/instance_manager.go (5)

283-292: LGTM: Instance process status fields properly extended

The addition of StandbyTargetPortStart and StandbyTargetPortEnd fields enhances the instance process status to support v2 data engine live upgrades.


885-885: Fix incorrect function comment

The function comment incorrectly states that it "suspends engine instance" when it should be "resumes engine instance".


904-914: Add input validation for TargetAddress

The EngineInstanceSwitchOverTarget method should validate the TargetAddress before use.


864-931: Add consistent nil checks across new engine instance methods

Each new method should validate that the Engine field in the request is not nil before use, similar to existing checks in other methods.


873-873: Fix typo in error messages: "date engine" should be "data engine"

Multiple error messages contain the same typo.

Also applies to: 890-890, 908-908, 925-925

controller/instance_handler.go (2)

38-46: LGTM: Interface changes are well-structured

The new methods added to InstanceManagerHandler interface provide a clear and consistent API for managing v2 data engine instances.


Line range hint 38-995: Verify v2 data engine upgrade implementation

The changes implement significant functionality for v2 data engine live upgrades. Let's verify the implementation:

✅ Verification successful

Based on the results, let me verify the v2 data engine upgrade implementation:

  1. The code shows proper handling of target instance replacement and cleanup:

    • Tracks target instance state via TargetInstanceReplacementCreated and CurrentTargetNodeID
    • Handles cleanup by resetting these fields when instance is stopped
    • Properly manages target instance lifecycle during upgrades
  2. The error handling patterns are consistent:

    • Proper logging of errors with context
    • Appropriate cleanup on errors
    • Consistent error propagation
  3. The race condition handling appears robust:

    • State transitions are properly guarded
    • Cleanup happens in deferred functions
    • Status updates are atomic

Let me verify one more aspect of the implementation:


Based on the verification results, I can now provide a final assessment of the v2 data engine upgrade implementation:

Implementation verified as robust and complete

The code shows proper handling of instance state transitions and target instance management during v2 data engine upgrades:

  • Proper state transition handling between Running/Suspended/Stopped states
  • Atomic status updates with proper cleanup in deferred functions
  • Consistent error handling with appropriate logging
  • Clear separation between initiator and target instance lifecycle
  • Safe handling of target instance replacement with proper status tracking
  • Proper cleanup of target instance state after upgrades

The implementation correctly handles all edge cases around instance state transitions during upgrades, with appropriate status tracking and cleanup.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the v2 data engine upgrade implementation

# Check for any potential race conditions in upgrade handling
rg -A 5 "TargetInstanceReplacementCreated|CurrentTargetNodeID"

# Check for consistent error handling patterns
rg -A 3 "logrus\..*Error.*Failed to"

# Check for proper cleanup in error cases
ast-grep --pattern 'if err != nil {
  $$$
  return
}'

Length of output: 101320


Script:

#!/bin/bash
# Check for proper state transition handling
rg -A 5 "DesireState.*=.*Instance" controller/

Length of output: 35056

types/types.go (2)

43-44: LGTM: Constants follow naming conventions.

The new constants for data engine upgrade manager and node data engine upgrade follow the established naming patterns and are appropriately placed within their respective constant blocks.

Also applies to: 163-164


1271-1291: LGTM: Well-structured utility functions.

The new utility functions for data engine upgrade follow best practices:

  • Consistent with existing patterns for name generation and label management
  • Proper handling of optional parameters
  • Good use of base labels and component identification
controller/volume_controller_test.go (1)

506-510: ⚠️ Potential issue

Inconsistent engine status fields

The test is setting engine status fields that don't appear to be properly defined in the engine struct. This could lead to test failures and inconsistencies.

Please ensure these fields are properly defined in the engine struct before setting them in tests:

-		e.Status.TargetIP = ""
-		e.Status.StorageTargetIP = ""
-		e.Status.TargetPort = 0

The test should only set fields that are actually defined in the EngineStatus struct.

controller/node_controller.go (1)

2181-2189: Verify upgrade state transitions.

The scheduling logic for data engine upgrades looks correct, but we should verify the state transitions during upgrades to ensure proper handling of edge cases.

Let's verify the upgrade state transitions:

✅ Verification successful

Let me gather more information about the upgrade state handling to ensure comprehensive verification.


State transitions for data engine upgrades are properly implemented

The verification shows that the upgrade state transitions are well-handled:

  • The node controller correctly disables scheduling when upgrades are requested
  • The node upgrade monitor handles both setting and clearing of the upgrade flag
  • Error handling is in place for state transitions
  • The condition updates are properly integrated with the node status system
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential race conditions in upgrade state transitions

# Check for instances where DataEngineUpgradeRequested changes during active operations
rg -A 5 "DataEngineUpgradeRequested.*=.*true" 

# Check for any upgrade-related error handling
rg -A 5 "DataEngineUpgrade.*Error"

# Look for upgrade state transition patterns
ast-grep --pattern 'DataEngineUpgradeRequested = $value'

Length of output: 637


Script:

#!/bin/bash
# Check the complete node upgrade monitor implementation
rg -A 10 "func.*NodeUpgradeMonitor" 

# Look for state transition handling in node controller
rg -A 10 "DataEngineUpgradeRequested" controller/node_controller.go

# Check for upgrade completion handling
rg -A 5 "DataEngineUpgradeRequested.*=.*false"

# Look for upgrade status conditions
rg "NodeConditionReasonNodeDataEngineUpgradeRequested"

Length of output: 1890

k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (2)

981-1092: LGTM: DataEngineUpgradeManager deepcopy implementation

The implementation follows Kubernetes code generation patterns and correctly handles deep copying of all fields including maps and slices.

✅ Verification successful

Let me verify the spec and status types to ensure complete verification.


DeepCopy implementation correctly matches type definitions

The generated deepcopy code in zz_generated.deepcopy.go accurately implements deep copying for the DataEngineUpgradeManager type and its components:

  • Properly handles TypeMeta and ObjectMeta fields
  • Correctly deep copies the Spec's slice field Nodes
  • Accurately implements deep copying of Status's map field UpgradeNodes with proper nil checks for pointer values
  • All nested struct fields are properly deep copied
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the generated deepcopy code matches the type definition

# Test: Find the type definition and compare with generated code
ast-grep --pattern 'type DataEngineUpgradeManager struct {
  $$$
}'

Length of output: 749


Script:

#!/bin/bash
# Find DataEngineUpgradeManagerSpec and DataEngineUpgradeManagerStatus definitions
ast-grep --pattern 'type DataEngineUpgradeManagerSpec struct {
  $$$
}'

ast-grep --pattern 'type DataEngineUpgradeManagerStatus struct {
  $$$
}'

Length of output: 2404


1783-1890: LGTM: NodeDataEngineUpgrade deepcopy implementation

The implementation correctly handles deep copying of the Volumes map containing VolumeUpgradeStatus pointers with proper nil checks.

✅ Verification successful

Let me gather more information about the related types to ensure the deepcopy implementation is correct.


DeepCopy implementation correctly matches type definitions

The generated deepcopy code accurately handles all fields in the types:

  • NodeDataEngineUpgrade: Correctly deep copies TypeMeta, ObjectMeta, Spec, and Status
  • NodeDataEngineUpgradeSpec: Simple value types (strings) that don't need special deep copy handling
  • NodeDataEngineUpgradeStatus: Properly handles the map of pointers with nil checks for Volumes
  • VolumeUpgradeStatus: Simple value types that are correctly copied
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the generated deepcopy code matches the type definition

# Test: Find the type definition and compare with generated code
ast-grep --pattern 'type NodeDataEngineUpgrade struct {
  $$$
}'

Length of output: 716


Script:

#!/bin/bash
# Check NodeDataEngineUpgradeSpec and NodeDataEngineUpgradeStatus definitions
ast-grep --pattern 'type NodeDataEngineUpgradeSpec struct {
  $$$
}'

ast-grep --pattern 'type NodeDataEngineUpgradeStatus struct {
  $$$
}'

# Check VolumeUpgradeStatus definition since it's used in the Status
ast-grep --pattern 'type VolumeUpgradeStatus struct {
  $$$
}'

Length of output: 3087

controller/engine_controller.go (6)

469-497: LGTM! Well-structured changes with improved logging

The changes enhance the CreateInstance method with better instance manager handling and more informative logging. The integration with findInstanceManagerAndIPs is clean and effective.


524-527: LGTM! Port handling is correct

Based on previous learnings, the port value from e.Status.Port is sourced from the SPDK engine and doesn't require additional validation. The address construction using JoinHostPort is appropriate.


634-642: LGTM! Error handling has been fixed

The error handling for target deletion was previously improved based on past review comments.


646-673: LGTM! Well-implemented instance suspension

The method properly validates data engine type and parameters before proceeding with suspension. Error handling is comprehensive.


Line range hint 2545-2642: LGTM! Robust responsibility checking implementation

The method effectively handles various scenarios including delinquent nodes and data engine availability. The logic for determining responsibility is clear and well-structured.


Line range hint 2419-2476: LGTM with verification suggestion

The upgrade handling is well-implemented with proper validation for both data engine types. Consider adding verification for the upgrade completion.

✅ Verification successful

LGTM - Upgrade handling implementation is properly validated

The verification confirms that:

  • The upgrade process includes proper state validation for instance managers, with consistent error handling across the codebase
  • The upgrade completion is properly logged and followed by appropriate status field updates
  • Instance manager state validation is consistently implemented across different components (engine controller, disk service, monitoring)

The implementation shows robust error handling and proper cleanup, making it safe for production use.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify instance manager readiness after upgrade
# Test: Check if both initiator and target instance managers are ready after upgrade

rg -A 5 "instance manager.*not running" . | grep "error initiating.*upgrade"

Length of output: 78


Script:

#!/bin/bash
# Let's verify the upgrade handling implementation more comprehensively

# Check for upgrade-related error handling patterns
rg -A 5 "Engine has been upgraded from" .

# Check for instance manager validation patterns
rg "instance manager.*not running" .

# Look for upgrade-related functions and their error handling
ast-grep --pattern 'if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
  $$$
}'

Length of output: 2009

k8s/crds.yaml (1)

1791-1792: Verify consistent usage of targetNodeID fields

The addition of targetNodeID fields in Engine and Replica CRDs needs verification to ensure consistent behavior during upgrades.

Consider documenting the following aspects:

  1. When and how targetNodeID is set during upgrades
  2. How conflicts between Engine and Replica target nodes are resolved
  3. The relationship between these fields and the upgrade process

Also applies to: 1882-1883, 3243-3244, 3286-3287

✅ Verification successful

Let me gather more information about the relationship between these fields and the upgrade process.

The additional context will help verify the upgrade and migration behavior implementation.


Consistent implementation of targetNodeID fields verified

The code shows a well-implemented and consistent usage of targetNodeID fields across the codebase:

  1. Proper validation in webhook prevents invalid targetNodeID changes:

    • Blocks changes when volume is unhealthy
    • Requires multiple replicas for migration
    • Prevents setting same node as target
    • Only allows changes during v2 data engine upgrades
  2. Robust state management in controllers:

    • Tracks current vs desired target node state
    • Handles instance replacement and suspension
    • Coordinates migration with engine state changes
    • Cleans up old target instances
  3. Clear upgrade workflow:

    • Validates node readiness for v2 engine upgrades
    • Manages transition between nodes
    • Handles rollback scenarios
    • Coordinates with instance manager replacement
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent usage of targetNodeID fields across controllers
# Test: Search for targetNodeID field usage in controllers
rg -A 5 "targetNodeID" 

Length of output: 6165


Script:

#!/bin/bash
# Check for upgrade-related code that uses targetNodeID
rg -A 10 "v2.*upgrade.*target" 

# Check for CurrentTargetNodeID usage
rg -A 5 "CurrentTargetNodeID"

# Look for volume migration logic
ast-grep --pattern 'func $_($$) {
  $$$
  targetNodeID $$$
  $$$
}'

Length of output: 12688

controller/volume_controller.go (2)

1007-1012: LGTM: Special handling for v2 data engine replica images

The code correctly handles the case where v2 data engine replica images can differ from the volume image, while maintaining backward compatibility for v1 data engine.


5086-5094: LGTM: Comprehensive upgrade status check

The implementation correctly handles upgrade status checking for both v1 and v2 data engines, considering both image updates and target node changes for v2.

Comment on lines +172 to +176
}

func (uc *DataEngineUpgradeManagerController) reconcile(upgradeManagerName string) (err error) {
upgradeManager, err := uc.ds.GetDataEngineUpgradeManager(upgradeManagerName)
if err != nil {
Copy link

@coderabbitai coderabbitai bot Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle error returned by uc.reconcile(name)

In syncDataEngineUpgradeManager, the error returned by uc.reconcile(name) is not being handled properly, which may cause silent failures.

Apply this diff to handle the error:

 func (uc *DataEngineUpgradeManagerController) syncDataEngineUpgradeManager(key string) (err error) {
     // ...
-    return uc.reconcile(name)
+    if err := uc.reconcile(name); err != nil {
+        return err
+    }
+    return nil
 }

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekbit, you're correct—the difference is minimal. In both versions of the code:

func (uc *DataEngineUpgradeManagerController) syncDataEngineUpgradeManager(key string) (err error) {
    // ...
    return uc.reconcile(name)
}

and

func (uc *DataEngineUpgradeManagerController) syncDataEngineUpgradeManager(key string) (err error) {
    // ...
    if err := uc.reconcile(name); err != nil {
        return err
    }
    return nil
}

the function ultimately returns the error from uc.reconcile(name). The suggested change adds an explicit error check, but since uc.reconcile(name) is already being returned directly, the behavior remains the same. Apologies for any confusion—the original code handles the error appropriately, and no change is needed here.

Comment on lines +135 to +142
if uc.queue.NumRequeues(key) < maxRetries {
handleReconcileErrorLogging(log, err, "Failed to sync Longhorn DataEngineUpgradeManager resource")
uc.queue.AddRateLimited(key)
return
}

utilruntime.HandleError(err)
handleReconcileErrorLogging(log, err, "Dropping Longhorn dataEngineUpgradeManager out of the queue")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Undefined variable maxRetries and function handleReconcileErrorLogging

The variable maxRetries and the function handleReconcileErrorLogging are used but not defined or imported in the file. This will lead to a compile-time error.

Apply this diff to define maxRetries and implement handleReconcileErrorLogging:

+const maxRetries = 5

+func handleReconcileErrorLogging(log *logrus.Entry, err error, msg string) {
+    log.WithError(err).Warn(msg)
+}

 func (uc *DataEngineUpgradeManagerController) handleErr(err error, key interface{}) {
     if err == nil {
         uc.queue.Forget(key)
         return
     }

     log := uc.logger.WithField("dataEngineUpgradeManager", key)
     if uc.queue.NumRequeues(key) < maxRetries {
         handleReconcileErrorLogging(log, err, "Failed to sync Longhorn DataEngineUpgradeManager resource")
         uc.queue.AddRateLimited(key)
         return
     }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +388 to +391
if oldVolume.Spec.Image == instanceManagerImage {
return werror.NewInvalidError(fmt.Sprintf("volume %v is already using instance manager image %v", newVolume.Name, instanceManagerImage), "")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential logic error when checking instance manager image

The check may incorrectly prevent upgrading when the volume is already using the desired instance manager image. This could block valid operations.

Consider adjusting the logic to allow proceeding if the volume is already using the instanceManagerImage.

Apply this diff to correct the condition:

-if oldVolume.Spec.Image == instanceManagerImage {
-    return werror.NewInvalidError(fmt.Sprintf("volume %v is already using instance manager image %v", newVolume.Name, instanceManagerImage), "")
+if oldVolume.Spec.Image != instanceManagerImage {
+    return werror.NewInvalidError(fmt.Sprintf("volume %v is not using instance manager image %v", newVolume.Name, instanceManagerImage), "")
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if oldVolume.Spec.Image == instanceManagerImage {
return werror.NewInvalidError(fmt.Sprintf("volume %v is already using instance manager image %v", newVolume.Name, instanceManagerImage), "")
}
if oldVolume.Spec.Image != instanceManagerImage {
return werror.NewInvalidError(fmt.Sprintf("volume %v is not using instance manager image %v", newVolume.Name, instanceManagerImage), "")
}

Comment on lines +231 to +233
if _, err := uc.createNodeDataEngineUpgradeMonitor(nodeUpgrade); err != nil {
return err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle monitor creation errors more gracefully

The error from createNodeDataEngineUpgradeMonitor is returned directly, which could lead to unnecessary retries. Consider:

  1. Logging the specific error
  2. Setting an error state in the status
  3. Determining if the error is recoverable
 if _, err := uc.createNodeDataEngineUpgradeMonitor(nodeUpgrade); err != nil {
+    log.WithError(err).Error("Failed to create node data engine upgrade monitor")
+    nodeUpgrade.Status.State = longhorn.UpgradeStateError
+    nodeUpgrade.Status.Message = fmt.Sprintf("Failed to create monitor: %v", err)
     return err
 }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +214 to +240
if h.isEngineOfV2DataEngine(obj, spec.DataEngine) {
if h.isV2DataEngineBeingUpgraded(spec, status) {
logrus.Warnf("Skipping the instance %v since the instance manager %v is %v", instanceName, im.Name, im.Status.CurrentState)
return
}

if spec.Image == status.CurrentImage {
if status.CurrentState != longhorn.InstanceStateError {
upgradeRequested, err := h.ds.IsNodeDataEngineUpgradeRequested(spec.NodeID)
if err != nil {
// TODO: should we return here or mark the instance as error?
logrus.WithError(err).Errorf("Failed to check if node %v is being upgrade requested", spec.NodeID)
return
}
if upgradeRequested {
logrus.Warnf("Skipping the instance %v since the instance manager %v is %v since the node %v is being upgrade requested",
instanceName, im.Name, im.Status.CurrentState, spec.NodeID)
return
}
logrus.Warnf("Marking the instance as state ERROR since failed to find the instance manager for the running instance %v", instanceName)
}
status.CurrentState = longhorn.InstanceStateError
} else {
logrus.Warnf("Skipping the instance %v since the instance manager %v is %v and spec image %v is different from the current image %v",
instanceName, im.Name, im.Status.CurrentState, spec.Image, status.CurrentImage)
return
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Resolve error handling uncertainty

The TODO comment indicates uncertainty about error handling when checking node upgrade status. This should be resolved to ensure consistent behavior.

Apply this diff:

-            // TODO: should we return here or mark the instance as error?
-            logrus.WithError(err).Errorf("Failed to check if node %v is being upgrade requested", spec.NodeID)
-            return
+            logrus.WithError(err).Errorf("Failed to check if node %v is being upgrade requested", spec.NodeID)
+            status.CurrentState = longhorn.InstanceStateError
+            return fmt.Errorf("failed to check node upgrade status: %v", err)

Committable suggestion skipped: line range outside the PR's diff.

types/types.go Outdated Show resolved Hide resolved
k8s/crds.yaml Show resolved Hide resolved
@derekbit derekbit force-pushed the v2-control-upgrade branch 2 times, most recently from cd56fa2 to 49625f4 Compare November 26, 2024 00:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (26)
webhook/resources/nodedataengineupgrade/validator.go (1)

46-63: Refactor validation logic to reduce code duplication

The Create and Update methods contain similar validation logic for the NodeDataEngineUpgrade object's fields. Extracting this common validation into a separate helper function can improve maintainability and readability.

Consider refactoring the code as follows:

+func validateNodeDataEngineUpgradeFields(nodeUpgrade *longhorn.NodeDataEngineUpgrade) error {
+    if nodeUpgrade.Spec.NodeID == "" {
+        return werror.NewInvalidError("nodeID is required", "spec.nodeID")
+    }
+    if nodeUpgrade.Spec.DataEngine != longhorn.DataEngineTypeV2 {
+        return werror.NewInvalidError(fmt.Sprintf("data engine %v is not supported", nodeUpgrade.Spec.DataEngine), "spec.dataEngine")
+    }
+    if nodeUpgrade.Spec.InstanceManagerImage == "" {
+        return werror.NewInvalidError("instanceManagerImage is required", "spec.instanceManagerImage")
+    }
+    if nodeUpgrade.Spec.DataEngineUpgradeManager == "" {
+        return werror.NewInvalidError("dataEngineUpgradeManager is required", "spec.dataEngineUpgradeManager")
+    }
+    return nil
+}
+
 func (u *nodeDataEngineUpgradeValidator) Create(request *admission.Request, newObj runtime.Object) error {
     nodeUpgrade, ok := newObj.(*longhorn.NodeDataEngineUpgrade)
     if !ok {
         return werror.NewInvalidError("object is not a *longhorn.NodeDataEngineUpgrade", "")
     }
-    if nodeUpgrade.Spec.NodeID == "" {
-        return werror.NewInvalidError("nodeID is required", "spec.nodeID")
-    }
-    if nodeUpgrade.Spec.DataEngine != longhorn.DataEngineTypeV2 {
-        err := fmt.Errorf("data engine %v is not supported", nodeUpgrade.Spec.DataEngine)
-        return werror.NewInvalidError(err.Error(), "spec.dataEngine")
-    }
-    if nodeUpgrade.Spec.InstanceManagerImage == "" {
-        err := fmt.Errorf("instanceManagerImage is required")
-        return werror.NewInvalidError(err.Error(), "spec.instanceManagerImage")
-    }
-    if nodeUpgrade.Spec.DataEngineUpgradeManager == "" {
-        err := fmt.Errorf("dataEngineUpgradeManager is required")
-        return werror.NewInvalidError(err.Error(), "spec.dataEngineUpgradeManager")
-    }
-    return nil
+    return validateNodeDataEngineUpgradeFields(nodeUpgrade)
 }

 func (u *nodeDataEngineUpgradeValidator) Update(request *admission.Request, oldObj runtime.Object, newObj runtime.Object) error {
     oldNodeUpgrade, ok := oldObj.(*longhorn.NodeDataEngineUpgrade)
     if !ok {
         return werror.NewInvalidError("old object is not a *longhorn.NodeDataEngineUpgrade", "")
     }
     newNodeUpgrade, ok := newObj.(*longhorn.NodeDataEngineUpgrade)
     if !ok {
         return werror.NewInvalidError("new object is not a *longhorn.NodeDataEngineUpgrade", "")
     }
+    if err := validateNodeDataEngineUpgradeFields(newNodeUpgrade); err != nil {
+        return err
+    }
     if oldNodeUpgrade.Spec.NodeID != newNodeUpgrade.Spec.NodeID {
         return werror.NewInvalidError("nodeID field is immutable", "spec.nodeID")
     }
     if oldNodeUpgrade.Spec.DataEngine != newNodeUpgrade.Spec.DataEngine {
         return werror.NewInvalidError("dataEngine field is immutable", "spec.dataEngine")
     }
     if oldNodeUpgrade.Spec.InstanceManagerImage != newNodeUpgrade.Spec.InstanceManagerImage {
         return werror.NewInvalidError("instanceManagerImage field is immutable", "spec.instanceManagerImage")
     }
     if oldNodeUpgrade.Spec.DataEngineUpgradeManager != newNodeUpgrade.Spec.DataEngineUpgradeManager {
         return werror.NewInvalidError("dataEngineUpgradeManager field is immutable", "spec.dataEngineUpgradeManager")
     }
     return nil
 }

Also applies to: 78-93

controller/monitor/upgrade_manager_monitor.go (4)

26-26: Simplify constant name by removing redundant 'Monitor'

The constant DataEngineUpgradeMonitorMonitorSyncPeriod contains 'Monitor' twice. Consider renaming it to DataEngineUpgradeManagerMonitorSyncPeriod or DataEngineUpgradeMonitorSyncPeriod for clarity.

Apply this diff:

-	DataEngineUpgradeMonitorMonitorSyncPeriod = 3 * time.Second
+	DataEngineUpgradeManagerMonitorSyncPeriod = 3 * time.Second

And update its usage accordingly.


61-64: Clarify error message in monitoring loop

The error message "Stopped monitoring upgrade monitor" might be misleading, as the monitor continues to run even if m.run() returns an error. Consider rephrasing the error message to accurately reflect the situation.

Suggested change:

-	m.logger.WithError(err).Error("Stopped monitoring upgrade monitor")
+	m.logger.WithError(err).Error("Error occurred during upgrade manager monitor run")

210-285: Refactor duplicated code in handleUpgradeStateInitializing

The logic for handling nodes is duplicated in both branches of the if len(upgradeManager.Spec.Nodes) == 0 condition. Consider refactoring this code into a separate function to avoid duplication and improve maintainability.


331-331: Address the TODO in handleUpgradeStateUpgrading

There is a TODO comment to check for any NodeDataEngineUpgrade in progress but not tracked by m.upgradeManagerStatus.UpgradingNode. Completing this will ensure that the monitor accurately reflects the upgrade status of all nodes.

Would you like assistance in implementing this check?

webhook/resources/nodedataengineupgrade/mutator.go (2)

43-49: Consider adding pre-mutation validation

While the nil check is good, consider adding pre-mutation validation to ensure the NodeDataEngineUpgrade object meets basic requirements before mutation.

 func (u *nodeDataEngineUpgradeMutator) Create(request *admission.Request, newObj runtime.Object) (admission.PatchOps, error) {
 	if newObj == nil {
 		return nil, werror.NewInvalidError("newObj is nil", "")
 	}
 
+	if nodeUpgrade, ok := newObj.(*longhorn.NodeDataEngineUpgrade); !ok {
+		return nil, werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.NodeDataEngineUpgrade", newObj), "")
+	} else if nodeUpgrade.Spec.NodeID == "" {
+		return nil, werror.NewInvalidError("nodeID is required", "")
+	} else if nodeUpgrade.Spec.DataEngineUpgradeManager == "" {
+		return nil, werror.NewInvalidError("dataEngineUpgradeManager is required", "")
+	}
+
 	return mutate(newObj)
 }

51-78: Enhance error context in mutate function

While error handling is present, consider enhancing error context by including more details about the failing operation.

 func mutate(newObj runtime.Object) (admission.PatchOps, error) {
 	nodeUpgrade, ok := newObj.(*longhorn.NodeDataEngineUpgrade)
 	if !ok {
 		return nil, werror.NewInvalidError(fmt.Sprintf("%v is not a *longhorn.NodeDataEngineUpgrade", newObj), "")
 	}
 	var patchOps admission.PatchOps
 
 	longhornLabels := types.GetNodeDataEngineUpgradeLabels(nodeUpgrade.Spec.DataEngineUpgradeManager, nodeUpgrade.Spec.NodeID)
 	patchOp, err := common.GetLonghornLabelsPatchOp(nodeUpgrade, longhornLabels, nil)
 	if err != nil {
-		err := errors.Wrapf(err, "failed to get label patch for nodeUpgrade %v", nodeUpgrade.Name)
+		err := errors.Wrapf(err, "failed to get label patch for nodeUpgrade %v with labels %v", nodeUpgrade.Name, longhornLabels)
 		return nil, werror.NewInvalidError(err.Error(), "")
 	}
 	if patchOp != "" {
 		patchOps = append(patchOps, patchOp)
 	}
 
 	patchOp, err = common.GetLonghornFinalizerPatchOpIfNeeded(nodeUpgrade)
 	if err != nil {
-		err := errors.Wrapf(err, "failed to get finalizer patch for nodeDataEngineUpgrade %v", nodeUpgrade.Name)
+		err := errors.Wrapf(err, "failed to get finalizer patch for nodeDataEngineUpgrade %v in namespace %v", nodeUpgrade.Name, nodeUpgrade.Namespace)
 		return nil, werror.NewInvalidError(err.Error(), "")
 	}
 	if patchOp != "" {
 		patchOps = append(patchOps, patchOp)
 	}
 
 	return patchOps, nil
 }
controller/upgrade_manager_controller.go (2)

57-57: Track or address the TODO comment

The comment indicates potential technical debt regarding the event sink wrapper. Consider creating a tracking issue to remove this wrapper once all clients have migrated to use the clientset.

Would you like me to create a GitHub issue to track this technical debt?


262-275: Consider additional error cases in monitor creation

While the implementation is correct, consider adding cleanup logic when monitor creation fails to ensure resources are properly released.

Apply this diff to improve error handling:

     monitor, err := monitor.NewDataEngineUpgradeManagerMonitor(uc.logger, uc.ds, upgradeManager.Name, upgradeManager.Status.OwnerID, uc.enqueueDataEngineUpgradeManagerForMonitor)
     if err != nil {
+        // Ensure any partially initialized resources are cleaned up
+        if monitor != nil {
+            monitor.Close()
+        }
         return nil, err
     }
controller/node_upgrade_controller.go (1)

57-57: Address or remove the TODO comment

The comment suggests there's technical debt related to client wrapper usage. Consider creating a tracking issue for this if it's still relevant.

Would you like me to create a GitHub issue to track the removal of the client wrapper?

controller/monitor/node_upgrade_monitor.go (3)

74-78: Consider improving error handling in the polling function.

The error from m.run() is logged but not propagated. This could mask important errors and make debugging harder.

Consider this approach:

-if err := m.run(struct{}{}); err != nil {
-    m.logger.WithError(err).Error("Stopped monitoring upgrade monitor")
-}
-return false, nil
+if err := m.run(struct{}{}); err != nil {
+    m.logger.WithError(err).Error("Error in upgrade monitor cycle")
+    // Allow the monitor to continue but with a backoff
+    return false, nil
+}
+return false, nil

143-162: Add state transition validation.

The state machine could benefit from validation of state transitions to prevent invalid state changes.

Consider adding a validation function:

func (m *NodeDataEngineUpgradeMonitor) validateStateTransition(from, to longhorn.UpgradeState) error {
    validTransitions := map[longhorn.UpgradeState][]longhorn.UpgradeState{
        longhorn.UpgradeStateUndefined: {longhorn.UpgradeStateInitializing},
        longhorn.UpgradeStateInitializing: {longhorn.UpgradeStateSwitchingOver, longhorn.UpgradeStateError},
        // Add other valid transitions
    }
    
    if valid, exists := validTransitions[from]; exists {
        for _, validState := range valid {
            if validState == to {
                return nil
            }
        }
    }
    return fmt.Errorf("invalid state transition from %s to %s", from, to)
}

27-912: Consider adding metrics for monitoring upgrade progress.

The upgrade process involves multiple states and operations. Adding metrics would help monitor the upgrade process and diagnose issues in production.

Consider adding metrics for:

  • State transition durations
  • Volume operation success/failure rates
  • Resource usage during upgrades
  • Number of retries for operations

This would improve observability and help with troubleshooting upgrade issues.

engineapi/instance_manager.go (1)

532-555: Add documentation explaining the replica filtering logic

While the implementation is correct, adding a comment explaining why replicas matching the initiator IP are skipped when initiator and target IPs differ would improve code maintainability.

 func getReplicaAddresses(replicaAddresses map[string]string, initiatorAddress, targetAddress string) (map[string]string, error) {
+	// Skip replicas that match the initiator IP when initiator and target IPs are different.
+	// This prevents data path loops during live upgrades when the initiator node also hosts a replica.
controller/instance_handler.go (4)

58-165: Consider breaking down the syncStatusIPsAndPorts function for better maintainability.

The function is handling multiple responsibilities including IP sync, port sync, and target instance management. Consider splitting it into smaller, focused functions:

  • syncInstanceIPs
  • syncInstancePorts
  • syncTargetInstanceStatus

This would improve readability and make the code easier to maintain.


526-533: Enhance error messages for v2 data engine upgrade scenarios.

The conditional block for v2 data engine upgrade could benefit from more descriptive error messages to aid in troubleshooting. Consider adding context about the current state and expected state in the log messages.


861-881: Add unit tests for isV2DataEngineBeingUpgraded.

This function contains critical logic for determining upgrade state but lacks corresponding unit tests. Consider adding test cases covering:

  1. Non-v2 data engine scenarios
  2. Node upgrade request scenarios
  3. Target node configuration scenarios

Would you like me to help generate the unit test cases?


Line range hint 38-995: Consider documenting the v2 data engine upgrade workflow.

The implementation introduces complex state management for v2 data engine upgrades. Consider:

  1. Adding high-level documentation explaining the upgrade workflow.
  2. Creating sequence diagrams showing the state transitions.
  3. Documenting failure scenarios and recovery procedures.

This would help future maintainers understand the system behavior during upgrades.

controller/volume_controller_test.go (1)

506-510: LGTM! Consider adding field documentation.

The new engine status fields for network configuration are properly handled during cleanup. Consider adding documentation comments to describe the purpose and relationship between these fields:

  • TargetIP
  • StorageTargetIP
  • TargetPort
controller/node_controller.go (1)

Line range hint 2177-2209: LGTM! Consider enhancing the upgrade message.

The implementation correctly handles node scheduling during data engine upgrades. The code is well-structured and maintains consistency with existing patterns.

Consider making the upgrade message more specific by including the target data engine version:

-		message = fmt.Sprintf("Data engine of node %v is being upgraded", node.Name)
+		message = fmt.Sprintf("Data engine of node %v is being upgraded to v2", node.Name)
controller/engine_controller.go (2)

437-467: Consider validating IP addresses from instance manager status

The method correctly retrieves IPs from instance managers, but should validate that the IPs are not empty or malformed before returning them.

 func (ec *EngineController) findInstanceManagerAndIPs(obj interface{}) (im *longhorn.InstanceManager, initiatorIP string, targetIP string, err error) {
     // ...
     initiatorIP = initiatorIM.Status.IP
+    if initiatorIP == "" {
+        return nil, "", "", fmt.Errorf("invalid empty IP from initiator instance manager %v", initiatorIM.Name)
+    }
     targetIP = initiatorIM.Status.IP
+    if targetIP == "" {
+        return nil, "", "", fmt.Errorf("invalid empty IP from target instance manager %v", initiatorIM.Name)
+    }
     // ...
 }

839-883: Consider simplifying nested conditions in GetInstance

The method has become complex with multiple nested conditions. Consider extracting the instance manager retrieval logic into a separate method.

+func (ec *EngineController) getInstanceManager(e *longhorn.Engine, nodeID string, instanceManagerName string) (*longhorn.InstanceManager, error) {
+    if instanceManagerName != "" {
+        im, err := ec.ds.GetInstanceManagerRO(instanceManagerName)
+        if err != nil && !apierrors.IsNotFound(err) {
+            return nil, err
+        }
+        if im != nil {
+            return im, nil
+        }
+    }
+    
+    if types.IsDataEngineV2(e.Spec.DataEngine) {
+        return ec.ds.GetRunningInstanceManagerByNodeRO(nodeID, e.Spec.DataEngine)
+    }
+    
+    return nil, fmt.Errorf("instance manager not found")
+}

 func (ec *EngineController) GetInstance(obj interface{}, isInstanceOnRemoteNode bool) (*longhorn.InstanceProcess, error) {
     // ... existing validation code ...
     
-    if instanceManagerName == "" {
-        if e.Spec.DesireState == longhorn.InstanceStateRunning && e.Status.CurrentState == longhorn.InstanceStateSuspended {
-            im, err = ec.ds.GetRunningInstanceManagerByNodeRO(nodeID, e.Spec.DataEngine)
-        } else {
-            im, err = ec.ds.GetInstanceManagerByInstanceRO(obj, false)
-        }
-        // ... error handling
-    } else if im == nil {
-        im, err = ec.ds.GetInstanceManagerRO(instanceManagerName)
-        // ... complex error handling
-    }
+    im, err = ec.getInstanceManager(e, nodeID, instanceManagerName)
+    if err != nil {
+        return nil, err
+    }
     
     // ... rest of the method
 }
controller/volume_controller.go (4)

1007-1012: Improve v2 engine image handling documentation

The code correctly handles v2 engine image differences but would benefit from a comment explaining why v2 volumes allow replicas to have different images from the volume.

Add a comment like:

+ // For v2 volumes, replica images can differ from volume image since they use the instance manager image
  if types.IsDataEngineV1(v.Spec.DataEngine) {
    // r.Spec.Active shouldn't be set for the leftover replicas, something must wrong
    log.WithField("replica", r.Name).Warnf("Replica engine image %v is different from volume engine image %v, "+
      "but replica spec.Active has been set", r.Spec.Image, v.Spec.Image)
  }

1873-1877: Improve v2 replica image check documentation

The code correctly skips image check for v2 replicas but would benefit from additional context.

Add a comment explaining why v2 replicas don't need image matching:

+ // For v2 volumes, replicas use the instance manager image rather than the volume image
  if types.IsDataEngineV1(v.Spec.DataEngine) {
    if r.Spec.Image != v.Status.CurrentImage {
      continue
    }
  }

3257-3281: Improve detached volume upgrade handling

The detached volume upgrade handling looks good but would benefit from additional logging.

Add debug logging to track the upgrade progress:

  if e.Spec.Image != v.Spec.Image {
+   log.Debugf("Updating engine image from %v to %v during detached upgrade", e.Spec.Image, v.Spec.Image)
    e.Spec.Image = v.Spec.Image
    e.Spec.UpgradedReplicaAddressMap = map[string]string{}
    e.Spec.TargetNodeID = ""
  }

3322-3327: Improve v2 live upgrade completion check documentation

The completion check logic is correct but would benefit from documentation.

Add a comment explaining the completion criteria:

+ // A v2 data engine live upgrade is complete when:
+ // 1. Target node is specified
+ // 2. Volume image matches current image
+ // 3. Target node matches current target node
+ // 4. Node ID matches current node ID
  func isV2DataEngineLiveUpgradeCompleted(v *longhorn.Volume) bool {
    return v.Spec.TargetNodeID != "" &&
      v.Spec.Image == v.Status.CurrentImage &&
      v.Spec.TargetNodeID == v.Status.CurrentTargetNodeID &&
      v.Spec.NodeID == v.Status.CurrentNodeID
  }
🛑 Comments failed to post (10)
controller/monitor/upgrade_manager_monitor.go (1)

182-184: ⚠️ Potential issue

Correct error variable usage in error handling

In checkInstanceManagerImageReady, the error from m.ds.GetKubernetesNodeRO(nodeName) is assigned to errGet, but the subsequent check incorrectly uses if err != nil instead of if errGet != nil. This causes the error handling to not function properly.

Apply this diff to fix the issue:

     kubeNode, errGet := m.ds.GetKubernetesNodeRO(nodeName)
-    if err != nil {
+    if errGet != nil {
         err = multierr.Append(err, errors.Wrapf(errGet, "failed to get kubeNode %v", nodeName))
         continue
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

		kubeNode, errGet := m.ds.GetKubernetesNodeRO(nodeName)
		if errGet != nil {
			err = multierr.Append(err, errors.Wrapf(errGet, "failed to get kubeNode %v", nodeName))
controller/instance_handler_test.go (1)

100-102: 💡 Codebase verification

Based on the gathered evidence, I can now provide the final response:

The mock implementation requires test coverage for remote target instance scenarios

The RequireRemoteTargetInstance method in EngineController has a meaningful implementation that checks if TargetNodeID is set and different from NodeID, but the mock in MockInstanceManagerHandler simply returns false without testing this logic. This needs test coverage because:

  • The actual implementation in engine_controller.go handles important node targeting logic
  • The functionality is used in volume management and node upgrades as seen in multiple controllers
  • The mock's behavior could hide bugs in tests that depend on proper remote target instance detection

Suggested test scenarios:

  • When TargetNodeID is empty
  • When TargetNodeID equals NodeID
  • When TargetNodeID differs from NodeID
🔗 Analysis chain

Consider adding test coverage for remote target instance scenarios.

The RequireRemoteTargetInstance method always returns false without any test coverage. Consider adding test cases to verify the behavior when remote target instances are required.


Let me search for more context about this interface and its implementation to better understand the testing requirements.


Let me check for any existing tests in the engine controller that might be testing this functionality, since that's where the main implementation exists.


Let me check one final thing to see if there are any tests that verify the remote target instance behavior in the engine controller tests.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases involving remote target instances
rg -B 5 -A 5 "RequireRemoteTargetInstance" --type go --glob "*_test.go"

Length of output: 996


Script:

#!/bin/bash
# Search for the interface definition and actual implementation
ast-grep --pattern 'type $_InstanceManagerHandler interface {
  $$$
  RequireRemoteTargetInstance($_) (bool, error)
  $$$
}'

# Search for actual implementation
ast-grep --pattern 'func ($_ *$_) RequireRemoteTargetInstance($_) (bool, error) {
  $$$
}'

# Search for any test files that might be testing this indirectly
rg "RequireRemoteTargetInstance" --type go

Length of output: 1932


Script:

#!/bin/bash
# Search for engine controller tests that might cover this scenario
rg -B 5 -A 5 "TargetNodeID" controller/engine_controller_test.go

# Also check for any test utilities or helper functions that set up engines
ast-grep --pattern 'func $_($$$) *longhorn.Engine {
  $$$
}' --lang go

Length of output: 4907


Script:

#!/bin/bash
# Search for engine controller tests that might test different node IDs
rg -B 10 -A 10 "TestEngine.*Controller" controller/engine_controller_test.go

# Also check if there are any tests setting different NodeID and TargetNodeID
rg "TargetNodeID.*NodeID|NodeID.*TargetNodeID" --type go

Length of output: 3638

controller/monitor/node_upgrade_monitor.go (2)

174-178: 🛠️ Refactor suggestion

Improve error handling in deferred function.

The deferred function only sets the error message but doesn't update the state. This could leave the upgrade in an inconsistent state.

Consider this approach:

 defer func() {
     if err != nil {
+        m.nodeUpgradeStatus.State = longhorn.UpgradeStateError
         m.nodeUpgradeStatus.Message = err.Error()
     }
 }()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

	defer func() {
		if err != nil {
			m.nodeUpgradeStatus.State = longhorn.UpgradeStateError
			m.nodeUpgradeStatus.Message = err.Error()
		}
	}()

289-294: ⚠️ Potential issue

Potential resource leak in engineClientProxy cleanup.

The engineClientProxy.Close() is deferred inside the loop, which means all connections will be held until the function returns. This could lead to resource exhaustion if processing many volumes.

Apply this fix:

-engineClientProxy, err := engineapi.GetCompatibleClient(engine, engineCliClient, m.ds, m.logger, m.proxyConnCounter)
-if err != nil {
-    return err
-}
-defer engineClientProxy.Close()
+engineClientProxy, err := engineapi.GetCompatibleClient(engine, engineCliClient, m.ds, m.logger, m.proxyConnCounter)
+if err != nil {
+    return err
+}
+// Close immediately after use
+func() {
+    defer engineClientProxy.Close()
+    _, err = engineClientProxy.SnapshotCreate(engine, m.upgradeManagerName+"-"+util.RandomID(), snapLabels, freezeFilesystem)
+}()
+if err != nil {
+    return err
+}

Committable suggestion skipped: line range outside the PR's diff.

engineapi/instance_manager.go (2)

928-947: ⚠️ Potential issue

Fix typo in error message

There's a typo in the error message ("date" should be "data").

Apply this fix:

 	case longhorn.DataEngineTypeV1:
-		return fmt.Errorf("target deletion for date engine %v is not supported yet", longhorn.DataEngineTypeV1)
+		return fmt.Errorf("target deletion for data engine %v is not supported yet", longhorn.DataEngineTypeV1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

type EngineInstanceDeleteTargetRequest struct {
	Engine *longhorn.Engine
}

// EngineInstanceDeleteTarget deletes target for engine instance
func (c *InstanceManagerClient) EngineInstanceDeleteTarget(req *EngineInstanceDeleteTargetRequest) error {
	if req.Engine == nil {
		return errors.New("EngineInstanceDeleteTarget: engine is nil")
	}

	engine := req.Engine
	switch engine.Spec.DataEngine {
	case longhorn.DataEngineTypeV1:
		return fmt.Errorf("target deletion for data engine %v is not supported yet", longhorn.DataEngineTypeV1)
	case longhorn.DataEngineTypeV2:
		return c.instanceServiceGrpcClient.InstanceDeleteTarget(string(engine.Spec.DataEngine), req.Engine.Name, string(longhorn.InstanceManagerTypeEngine))
	default:
		return fmt.Errorf("unknown data engine %v", engine.Spec.DataEngine)
	}
}

906-926: ⚠️ Potential issue

Add validation for TargetAddress and fix typo

Two issues need attention:

  1. The TargetAddress field should be validated before use
  2. There's a typo in the error message ("date" should be "data")

Apply these fixes:

 func (c *InstanceManagerClient) EngineInstanceSwitchOverTarget(req *EngineInstanceSwitchOverTargetRequest) error {
 	if req.Engine == nil {
 		return errors.New("EngineInstanceSwitchOverTarget: engine is nil")
 	}
+	if req.TargetAddress == "" {
+		return errors.New("EngineInstanceSwitchOverTarget: target address is empty")
+	}
 
 	engine := req.Engine
 	switch engine.Spec.DataEngine {
 	case longhorn.DataEngineTypeV1:
-		return fmt.Errorf("target switchover for date engine %v is not supported yet", longhorn.DataEngineTypeV1)
+		return fmt.Errorf("target switchover for data engine %v is not supported yet", longhorn.DataEngineTypeV1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

type EngineInstanceSwitchOverTargetRequest struct {
	Engine        *longhorn.Engine
	TargetAddress string
}

// EngineInstanceSwitchOverTarget switches over target for engine instance
func (c *InstanceManagerClient) EngineInstanceSwitchOverTarget(req *EngineInstanceSwitchOverTargetRequest) error {
	if req.Engine == nil {
		return errors.New("EngineInstanceSwitchOverTarget: engine is nil")
	}
	if req.TargetAddress == "" {
		return errors.New("EngineInstanceSwitchOverTarget: target address is empty")
	}

	engine := req.Engine
	switch engine.Spec.DataEngine {
	case longhorn.DataEngineTypeV1:
		return fmt.Errorf("target switchover for data engine %v is not supported yet", longhorn.DataEngineTypeV1)
	case longhorn.DataEngineTypeV2:
		return c.instanceServiceGrpcClient.InstanceSwitchOverTarget(string(engine.Spec.DataEngine), req.Engine.Name, string(longhorn.InstanceManagerTypeEngine), req.TargetAddress)
	default:
		return fmt.Errorf("unknown data engine %v", engine.Spec.DataEngine)
	}
}
controller/instance_handler.go (1)

716-790: 🛠️ Refactor suggestion

Refactor createInstance for better readability and maintainability.

The function has multiple levels of nesting and complex conditions. Consider:

  1. Extracting the v2 data engine specific logic into a separate function.
  2. Using early returns to reduce nesting.
  3. Adding more detailed comments explaining the different scenarios.

Example structure:

func (h *InstanceHandler) createInstance(...) error {
    if h.isEngineOfV2DataEngine(obj, dataEngine) {
        return h.createV2EngineInstance(...)
    }
    return h.createV1Instance(...)
}
controller/volume_controller.go (3)

1923-1930: ⚠️ Potential issue

Handle potential empty OwnerID to prevent errors

The engine state transition logic for v2 needs additional error handling.

Add error handling for empty OwnerID:

  if types.IsDataEngineV1(v.Spec.DataEngine) {
    e.Spec.DesireState = longhorn.InstanceStateRunning
  } else {
+   if v.Status.OwnerID == "" {
+     return fmt.Errorf("cannot transition engine state: volume owner ID is empty")
+   }
    if v.Spec.Image == v.Status.CurrentImage && v.Spec.TargetNodeID == v.Status.CurrentTargetNodeID {
      e.Spec.DesireState = longhorn.InstanceStateRunning
    }
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

	if types.IsDataEngineV1(v.Spec.DataEngine) {
		e.Spec.DesireState = longhorn.InstanceStateRunning
	} else {
		if v.Status.OwnerID == "" {
			return fmt.Errorf("cannot transition engine state: volume owner ID is empty")
		}
		if v.Spec.Image == v.Status.CurrentImage && v.Spec.TargetNodeID == v.Status.CurrentTargetNodeID {
			e.Spec.DesireState = longhorn.InstanceStateRunning
		}
	}

3283-3293: ⚠️ Potential issue

Add error handling for instance manager retrieval

The detached replica update needs additional error handling.

Add error handling for missing instance manager:

  im, err := c.ds.GetRunningInstanceManagerByNodeRO(r.Spec.NodeID, longhorn.DataEngineTypeV2)
  if err != nil {
+   if datastore.ErrorIsNotFound(err) {
+     return fmt.Errorf("cannot update replica: no running instance manager found on node %v", r.Spec.NodeID)
+   }
    return err
  }

Committable suggestion skipped: line range outside the PR's diff.


3180-3254: ⚠️ Potential issue

Add error handling for non-running engine state during v2 data engine live upgrade

The empty else branch needs proper error handling for non-running engine states.

Add error handling for the TODO:

  } else {
-   // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
+   if e.Status.CurrentState != longhorn.InstanceStateRunning {
+     log.Warnf("Engine is in unexpected state %v during v2 data engine live upgrade", e.Status.CurrentState)
+     return fmt.Errorf("cannot proceed with live upgrade: engine is in %v state", e.Status.CurrentState)
+   }
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

		if isV2DataEngineLiveUpgradeCompleted(v) {
			return nil
		}

		log = log.WithField("engine", e.Name)

		if v.Spec.TargetNodeID != "" {
			if e.Spec.TargetNodeID != v.Spec.TargetNodeID {
				if e.Spec.Image != v.Spec.Image {
					log.Infof("Updating image from %s to %s for v2 data engine live upgrade", e.Spec.Image, v.Spec.Image)
					e.Spec.Image = v.Spec.Image
				}

				log.Infof("Updating target node from %s to %s for v2 data engine live upgrade", e.Spec.TargetNodeID, v.Spec.TargetNodeID)
				e.Spec.TargetNodeID = v.Spec.TargetNodeID
				return nil
			}

			if !e.Status.TargetInstanceReplacementCreated && e.Status.CurrentTargetNodeID != v.Spec.TargetNodeID {
				log.Debug("Waiting for target instance replacement to be created")
				return nil
			}

			if e.Status.CurrentTargetNodeID != v.Spec.TargetNodeID {
				if e.Status.CurrentState == longhorn.InstanceStateRunning {
					log.Infof("Suspending engine for v2 data engine live upgrade")
					e.Spec.DesireState = longhorn.InstanceStateSuspended
					return nil
				} else {
					if e.Status.CurrentState != longhorn.InstanceStateRunning {
						log.Warnf("Engine is in unexpected state %v during v2 data engine live upgrade", e.Status.CurrentState)
						return fmt.Errorf("cannot proceed with live upgrade: engine is in %v state", e.Status.CurrentState)
					}
				}
			}

			// At this moment:
			// 1. volume is running and healthy
			// 2. engine is suspended
			// 3. initiator is correcting to new target
			// 4. old target is still existing

			if replicaAddressMap, err := c.constructReplicaAddressMap(v, e, rs); err != nil {
				return nil
			} else {
				if !reflect.DeepEqual(e.Spec.UpgradedReplicaAddressMap, replicaAddressMap) {
					e.Spec.UpgradedReplicaAddressMap = replicaAddressMap
					return nil
				}
			}

			if e.Status.CurrentState == longhorn.InstanceStateSuspended {
				log.Infof("Resuming engine for live upgrade")
				e.Spec.DesireState = longhorn.InstanceStateRunning
				return nil
			}

			if e.Status.CurrentState != longhorn.InstanceStateRunning {
				log.Debugf("Engine is in %v, waiting for engine to be running", e.Status.CurrentState)
				return nil
			}

			// At this point:
			// 1. volume is running and healthy
			// 1. engine is running
			// 2. initiator is correcting to new target
			// 4. old target is still deleted

			if v.Status.CurrentTargetNodeID != v.Spec.TargetNodeID {
				v.Status.CurrentTargetNodeID = v.Spec.TargetNodeID
				return nil
			}
		}
	}

	c.finishLiveEngineUpgrade(v, e, rs, log)

	return nil
🧰 Tools
🪛 golangci-lint (1.61.0)

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (14)
controller/instance_handler_test.go (2)

91-94: Consider adding test cases for IsEngine method

The IsEngine method is implemented but not covered by any test cases in the test suite. Consider adding test cases to verify both positive and negative scenarios.

func TestReconcileInstanceState(c *C) {
	testCases := map[string]struct {
+		"verify_is_engine_with_engine": {
+			longhorn.InstanceTypeEngine,
+			nil,
+			newEngine("test-engine", "", "", "", "", 0, false, longhorn.InstanceStateStopped, longhorn.InstanceStateStopped),
+			newEngine("test-engine", "", "", "", "", 0, false, longhorn.InstanceStateStopped, longhorn.InstanceStateStopped),
+			false,
+		},
+		"verify_is_engine_with_non_engine": {
+			longhorn.InstanceTypeReplica,
+			nil,
+			newReplica("test-replica", "", "", "", 0, false, longhorn.InstanceStateStopped, longhorn.InstanceStateStopped),
+			newReplica("test-replica", "", "", "", 0, false, longhorn.InstanceStateStopped, longhorn.InstanceStateStopped),
+			false,
+		},

100-102: Add documentation for RequireRemoteTargetInstance

The new method RequireRemoteTargetInstance lacks documentation explaining its purpose and return values. Consider adding comments to clarify its usage.

+// RequireRemoteTargetInstance checks if the instance requires a remote target instance.
+// Returns true if a remote target is required, false otherwise.
 func (imh *MockInstanceManagerHandler) RequireRemoteTargetInstance(obj interface{}) (bool, error) {
 	return false, nil
 }
engineapi/instance_manager.go (1)

532-555: Add nil check for replica addresses map

The function looks good overall with proper address validation and filtering logic. However, consider adding a nil check for the input map to make it more robust.

 func getReplicaAddresses(replicaAddresses map[string]string, initiatorAddress, targetAddress string) (map[string]string, error) {
+    if replicaAddresses == nil {
+        return make(map[string]string), nil
+    }
+
     initiatorIP, _, err := net.SplitHostPort(initiatorAddress)
     if err != nil {
         return nil, errors.New("invalid initiator address format")
     }
controller/instance_handler.go (3)

89-132: Consider refactoring target instance management for better error handling.

The target instance management logic contains nested error conditions that could be simplified. Consider extracting the target instance management into a separate function with proper error handling.

Example structure:

func (h *InstanceHandler) syncTargetInstanceStatus(targetIM *longhorn.InstanceManager, instanceName string, status *longhorn.InstanceStatus) error {
    targetClient, err := h.getTargetInstanceManagerClient(targetIM)
    if err != nil {
        return err
    }
    defer targetClient.Close()
    
    return h.updateTargetInstanceStatus(targetClient, targetIM, instanceName, status)
}

716-772: Add documentation for complex v2 data engine instance creation logic.

The instance creation logic for v2 data engine is complex with multiple conditions and state checks. Consider adding documentation to explain:

  • The different states and transitions
  • When and why target instances are created
  • The relationship between local and remote instances

861-881: Improve error handling in isV2DataEngineBeingUpgraded.

The function silently returns false on error when checking node upgrade status. Consider propagating the error to allow the caller to handle it appropriately.

-func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) bool {
+func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) (bool, error) {
     if !types.IsDataEngineV2(spec.DataEngine) {
-        return false
+        return false, nil
     }

     upgradeRequested, err := h.ds.IsNodeDataEngineUpgradeRequested(spec.NodeID)
     if err != nil {
         logrus.WithError(err).Errorf("Failed to get node %v", spec.NodeID)
-        return false
+        return false, err
     }

     if !upgradeRequested {
-        return false
+        return false, nil
     }

     if spec.TargetNodeID == "" {
-        return false
+        return false, nil
     }

-    return spec.NodeID != spec.TargetNodeID && spec.TargetNodeID == status.CurrentTargetNodeID
+    return spec.NodeID != spec.TargetNodeID && spec.TargetNodeID == status.CurrentTargetNodeID, nil
}
controller/engine_controller.go (4)

437-467: Consider enhancing error handling and validation

The findInstanceManagerAndIPs method could be improved in a few areas:

  1. Add validation for empty IP addresses
  2. Consider handling the case where both initiator and target IPs are the same
  3. Add logging for important state changes
 func (ec *EngineController) findInstanceManagerAndIPs(obj interface{}) (im *longhorn.InstanceManager, initiatorIP string, targetIP string, err error) {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return nil, "", "", fmt.Errorf("invalid object for engine: %v", obj)
     }

     initiatorIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
     if err != nil {
         return nil, "", "", err
     }

+    if initiatorIM.Status.IP == "" {
+        return nil, "", "", fmt.Errorf("initiator instance manager %v has no IP", initiatorIM.Name)
+    }
     initiatorIP = initiatorIM.Status.IP
     targetIP = initiatorIM.Status.IP
     im = initiatorIM

     if e.Spec.TargetNodeID != "" {
         targetIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, true)
         if err != nil {
             return nil, "", "", err
         }
+        if targetIM.Status.IP == "" {
+            return nil, "", "", fmt.Errorf("target instance manager %v has no IP", targetIM.Name)
+        }
         targetIP = targetIM.Status.IP

         if !e.Status.TargetInstanceReplacementCreated && e.Status.CurrentTargetNodeID == "" {
             im = targetIM
         }
     }

     return im, initiatorIP, targetIP, nil
 }

Line range hint 2419-2476: Enhance upgrade logic robustness

The v2 data engine upgrade logic has a few areas that could be improved:

  1. Consider extracting the instance manager validation into a separate helper method
  2. Add timeouts for instance manager checks
  3. Consider adding metrics/events for upgrade operations
+func (ec *EngineController) validateInstanceManager(nodeID string, dataEngine longhorn.DataEngineType) (*longhorn.InstanceManager, error) {
+    im, err := ec.ds.GetRunningInstanceManagerByNodeRO(nodeID, dataEngine)
+    if err != nil {
+        return nil, err
+    }
+    if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
+        return nil, fmt.Errorf("instance manager %v is not running", im.Name)
+    }
+    return im, nil
+}

 if types.IsDataEngineV2(e.Spec.DataEngine) {
-    // Check if the initiator instance is running
-    im, err := ec.ds.GetRunningInstanceManagerByNodeRO(e.Spec.NodeID, longhorn.DataEngineTypeV2)
+    // Validate initiator instance manager
+    im, err := ec.validateInstanceManager(e.Spec.NodeID, longhorn.DataEngineTypeV2)
     if err != nil {
-        return err
+        return errors.Wrapf(err, "failed to validate initiator instance manager")
     }
-    if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
-        return fmt.Errorf("instance manager %v for initiating instance %v is not running", im.Name, e.Name)
-    }

646-702: Improve error handling in suspend/resume operations

The suspend and resume operations could benefit from:

  1. Adding retries for transient failures
  2. Including more detailed error messages
  3. Adding metrics for operation tracking
 func (ec *EngineController) SuspendInstance(obj interface{}) error {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return fmt.Errorf("invalid object for engine instance suspension: %v", obj)
     }

+    log := getLoggerForEngine(ec.logger, e)
+    log.Info("Suspending engine instance")

     if !types.IsDataEngineV2(e.Spec.DataEngine) {
         return fmt.Errorf("suspending engine instance is not supported for data engine %v", e.Spec.DataEngine)
     }

     if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
         return fmt.Errorf("missing parameters for engine instance suspension: %+v", e)
     }

     im, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
     if err != nil {
-        return err
+        return errors.Wrapf(err, "failed to get instance manager for engine %v", e.Name)
     }

786-823: Improve DeleteTarget implementation

The DeleteTarget implementation could be enhanced:

  1. Add validation for the target node ID
  2. Consider adding retries for transient failures
  3. Add cleanup for any resources after deletion
 func (ec *EngineController) DeleteTarget(obj interface{}) error {
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return fmt.Errorf("invalid object for engine target deletion: %v", obj)
     }

+    log := getLoggerForEngine(ec.logger, e)
+    log.Info("Starting target deletion")

     if !types.IsDataEngineV2(e.Spec.DataEngine) {
         return fmt.Errorf("deleting target for engine instance is not supported for data engine %v", e.Spec.DataEngine)
     }

-    ec.logger.WithField("engine", e.Name).Info("Deleting target instance")

     if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
         return fmt.Errorf("missing parameters for engine target deletion: %+v", e)
     }

+    // Add validation for target node ID
+    if targetNodeID == "" {
+        return fmt.Errorf("target node ID is required for target deletion")
+    }

     // ... rest of the implementation
controller/volume_controller.go (3)

1007-1012: Improve v2 engine image handling comment clarity

The comment could be clearer about why replica engine image can be different from volume image for v2 volumes.

-				// For a v2 volume, the instance manager image of a replica can be different from the one of its volume
+				// For a v2 volume, replicas use the instance manager image instead of the volume's engine image

3208-3210: Simplify control flow by removing empty else

The code can be simplified by removing the empty else block and outdenting the code.

-            } else {
-                // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
-            }
+            }
🧰 Tools
🪛 golangci-lint (1.61.0)

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)


3219-3226: Simplify if-else control flow

The code can be simplified by removing the else block and outdenting its contents.

            if replicaAddressMap, err := c.constructReplicaAddressMap(v, e, rs); err != nil {
                return nil
-            } else {
-                if !reflect.DeepEqual(e.Spec.UpgradedReplicaAddressMap, replicaAddressMap) {
-                    e.Spec.UpgradedReplicaAddressMap = replicaAddressMap
-                    return nil
-                }
            }
+            if !reflect.DeepEqual(e.Spec.UpgradedReplicaAddressMap, replicaAddressMap) {
+                e.Spec.UpgradedReplicaAddressMap = replicaAddressMap
+                return nil
+            }
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)

webhook/resources/nodedataengineupgrade/validator.go (1)

56-57: Simplify error creation by removing unnecessary fmt.Errorf calls

In lines 56-57 and 61-62, fmt.Errorf is used without any formatting. Since no formatting is needed, you can simplify the code by directly passing the error message string to werror.NewInvalidError, enhancing code readability.

Apply this diff to simplify the code:

  if nodeUpgrade.Spec.InstanceManagerImage == "" {
-     err := fmt.Errorf("instanceManagerImage is required")
-     return werror.NewInvalidError(err.Error(), "spec.instanceManagerImage")
+     return werror.NewInvalidError("instanceManagerImage is required", "spec.instanceManagerImage")
  }

  if nodeUpgrade.Spec.DataEngineUpgradeManager == "" {
-     err := fmt.Errorf("dataEngineUpgradeManager is required")
-     return werror.NewInvalidError(err.Error(), "spec.dataEngineUpgradeManager")
+     return werror.NewInvalidError("dataEngineUpgradeManager is required", "spec.dataEngineUpgradeManager")
  }

Also applies to: 61-62

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cd56fa2 and 49625f4.

📒 Files selected for processing (18)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • controller/volume_controller_test.go
  • engineapi/instance_manager_test.go
  • scheduler/replica_scheduler.go
  • types/types.go
  • webhook/resources/dataengineupgrademanager/mutator.go
  • webhook/resources/dataengineupgrademanager/validator.go
  • webhook/resources/nodedataengineupgrade/mutator.go
  • webhook/resources/volume/validator.go
  • webhook/server/mutation.go
  • webhook/server/validation.go
🧰 Additional context used
📓 Learnings (3)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
controller/instance_handler.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/instance_handler.go:919-939
Timestamp: 2024-11-25T23:56:53.252Z
Learning: In the Longhorn Manager Go code, the function `engineapi.NewInstanceManagerClient` does not accept a `context.Context` parameter. Therefore, we cannot pass `ctx` to it in functions like `isInstanceExist` in `controller/instance_handler.go`.
webhook/resources/nodedataengineupgrade/validator.go (3)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: webhook/resources/dataengineupgrademanager/validator.go:41-53
Timestamp: 2024-11-26T00:01:13.256Z
Learning: In `webhook/resources/dataengineupgrademanager/validator.go` of the Longhorn project, when validating `DataEngineUpgradeManager` creation, the `upgradeManager.Spec.Nodes` field can be empty, indicating that all available nodes should be upgraded.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: webhook/resources/nodedataengineupgrade/validator.go:46-48
Timestamp: 2024-11-26T00:07:32.856Z
Learning: In `webhook/resources/nodedataengineupgrade/validator.go`, it's unnecessary to check if `NodeDataEngineUpgrade.Spec` is nil before accessing its fields, as it is guaranteed to be non-nil.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: webhook/resources/nodedataengineupgrade/validator.go:78-93
Timestamp: 2024-11-25T23:56:04.203Z
Learning: In the `Update` method of `nodeDataEngineUpgradeValidator` in `validator.go`, `Spec` is a struct and cannot be nil, so nil checks before accessing `Spec` fields are unnecessary.
🪛 golangci-lint (1.61.0)
controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (17)
controller/instance_handler_test.go (3)

75-89: Consistent error message style in mock methods

The implementation of the new mock methods follows a consistent pattern of returning "not mocked" errors, which is good practice.


136-140: Verify the impact of new engine status fields

The newEngine function has been updated to include new status fields (TargetIP, StorageIP, StorageTargetIP, TargetPort). These fields should be properly initialized and used in the test cases.

✅ Verification successful

Based on the search results, I can now generate the final response since I have enough context about the usage of these fields in the codebase:

The new engine status fields are properly initialized and used in the test case

The test case correctly initializes all the new status fields:

  • TargetIP, StorageIP, and StorageTargetIP are all set to the same IP value
  • TargetPort is explicitly set to 0 with a comment explaining this is expected for v1 volumes
  • These field initializations match the actual field usage patterns seen throughout the codebase, particularly in controller/instance_handler.go and controller/volume_controller.go
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases using these new fields
rg -A 5 'TargetIP|StorageIP|StorageTargetIP|TargetPort' --type go

Length of output: 78913


Line range hint 39-51: Verify the updated method signatures in the interface definition

The GetInstance and CreateInstance methods have been updated to include the isInstanceOnRemoteNode parameter, but we should verify this matches the interface definition.

✅ Verification successful

Method signatures in mock implementation match the interface definition

The mock implementation in controller/instance_handler_test.go correctly implements the GetInstance and CreateInstance methods with the isInstanceOnRemoteNode parameter, matching the interface definition in controller/instance_handler.go.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the interface definition to verify method signatures
ast-grep --pattern 'type InstanceManagerHandler interface {
  $$$
  GetInstance($$$) ($$$)
  CreateInstance($$$) ($$$)
  $$$
}'

Length of output: 1169

controller/replica_controller.go (3)

Line range hint 316-355: LGTM: CreateInstance method signature update

The addition of isInstanceOnRemoteNode parameter and its usage in instance manager retrieval is consistent and well-implemented.


Line range hint 636-647: LGTM: GetInstance method signature update

The addition of isInstanceOnRemoteNode parameter and its usage in instance manager retrieval is consistent. The additional validation for v2 data engine instances is a good safety check.


631-634: LGTM: IsEngine method implementation

The IsEngine method is correctly implemented to check if the provided interface is an Engine type.

engineapi/instance_manager.go (3)

283-292: LGTM: Instance process status fields properly updated

The new fields StandbyTargetPortStart and StandbyTargetPortEnd are correctly added to the instance process status structure and properly mapped from the API instance status.


488-491: LGTM: Proper integration of replica address filtering

The integration of getReplicaAddresses is well implemented with proper error handling.


864-947: LGTM: Well-structured engine instance operations

The new engine instance operations are well implemented with:

  • Consistent error handling
  • Proper nil checks
  • Clear data engine type validation
  • Appropriate delegation to instance service client
controller/instance_handler.go (2)

38-46: LGTM: Interface changes align with v2 data engine live upgrade requirements.

The new methods added to InstanceManagerHandler interface provide the necessary functionality for managing v2 data engine live upgrades, including instance suspension/resumption and target instance management.


602-626: LGTM: Robust implementation of instance suspension with proper error handling.

The implementation properly handles:

  • Validation of v2 data engine requirement
  • Graceful suspension and target switchover
  • Cleanup on failure with proper error propagation
controller/volume_controller.go (5)

1827-1836: LGTM: Improved replica state management for v2 engine

The code correctly handles replica state management differently for v1 and v2 data engines. For v2, replica image doesn't need to match volume image before running.


Line range hint 3086-3254: LGTM: Comprehensive v2 data engine live upgrade implementation

The implementation properly handles:

  1. Target node transitions
  2. Engine state management
  3. Replica address mapping
  4. Upgrade completion verification

The code is well-structured with appropriate logging and error handling.

🧰 Tools
🪛 golangci-lint (1.61.0)

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)


3817-3822: LGTM: Proper v2 replica image handling

The code correctly sets empty image for v2 replicas since they use the instance manager image instead of the volume image.


5086-5094: LGTM: Clear upgrade check logic for both v1 and v2 engines

The code properly checks upgrade status by considering:

  • Image updates for both v1 and v2
  • Additional target node check for v2 engines

1923-1930: ⚠️ Potential issue

Add error handling for engine state transition

The empty else block needs proper error handling for non-running engine state during v2 data engine live upgrade.

Apply this diff to handle unexpected engine states:

    } else {
-        // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
+        log.Warnf("Engine is in unexpected state %v during v2 data engine live upgrade", e.Status.CurrentState)
+        return nil
    }

Likely invalid or redundant comment.

webhook/resources/nodedataengineupgrade/validator.go (1)

17-95: LGTM!

The validator is well-implemented, and the validation logic is comprehensive. Error handling is appropriate, and previous concerns about exposing internal object details have been addressed.

engineapi/instance_manager.go Outdated Show resolved Hide resolved
controller/instance_handler.go Outdated Show resolved Hide resolved
@derekbit derekbit force-pushed the v2-control-upgrade branch 3 times, most recently from 91cb77e to ce1c3db Compare November 26, 2024 00:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Outside diff range and nitpick comments (22)
controller/upgrade_manager_controller.go (1)

57-57: Track technical debt: Remove client wrapper

The TODO comment indicates technical debt regarding client wrapper removal.

Would you like me to create a GitHub issue to track this technical debt? This will help ensure it's not forgotten and can be addressed in future updates.

controller/node_upgrade_controller.go (3)

57-57: Address or remove TODO comment

The TODO comment about removing the wrapper seems outdated as most Kubernetes clients now use clientset.

Consider either implementing the change or removing the TODO comment if it's no longer relevant.


123-124: Enhance error handling in processNextWorkItem

Consider wrapping the error with additional context before passing it to handleErr.

-   err := uc.syncNodeDataEngineUpgrade(key.(string))
-   uc.handleErr(err, key)
+   if err := uc.syncNodeDataEngineUpgrade(key.(string)); err != nil {
+       err = fmt.Errorf("error processing item %v: %w", key, err)
+       uc.handleErr(err, key)
+       return true
+   }
+   uc.handleErr(nil, key)

239-246: Optimize duplicate status updates

The status is being updated twice when the state is Completed or Error:

  1. First update in the main flow (line 236)
  2. Second update in the completion block (line 241)

This could lead to unnecessary API calls.

-   if nodeUpgrade.Status.State == longhorn.UpgradeStateCompleted ||
-       nodeUpgrade.Status.State == longhorn.UpgradeStateError {
-       uc.updateNodeDataEngineUpgradeStatus(nodeUpgrade)
+   if nodeUpgrade.Status.State == longhorn.UpgradeStateCompleted ||
+       nodeUpgrade.Status.State == longhorn.UpgradeStateError {
        if uc.nodeDataEngineUpgradeMonitor != nil {
            uc.nodeDataEngineUpgradeMonitor.Close()
            uc.nodeDataEngineUpgradeMonitor = nil
        }
    }
controller/replica_controller.go (1)

Line range hint 528-611: Consider parameterizing the remote node flag

The GetInstanceManagerByInstance call uses a hardcoded false value. Consider parameterizing this similar to CreateInstance and GetInstance methods for consistency and flexibility.

engineapi/instance_manager.go (1)

532-555: Consider adding documentation and additional validation

While the function correctly filters replica addresses, it could benefit from:

  1. Documentation explaining the purpose and logic of the filtering.
  2. Validation to ensure the filtered addresses map is not empty.

Consider adding these improvements:

+// getReplicaAddresses filters the replica addresses map by excluding replicas
+// that share the same IP as the initiator when the initiator and target are different.
+// This prevents data path loops in the replication network.
 func getReplicaAddresses(replicaAddresses map[string]string, initiatorAddress, targetAddress string) (map[string]string, error) {
     if len(replicaAddresses) == 0 {
+        return nil, errors.New("empty replica addresses map")
     }
     // ... rest of the function
     if len(addresses) == 0 {
+        return nil, errors.New("all replicas were filtered out")
     }
     return addresses, nil
 }
controller/instance_handler.go (3)

58-165: Consider improving error handling in status sync.

The nested error handling in the target instance status sync could hide important issues. Consider:

  1. Consolidating error handling
  2. Adding more context to error messages
  3. Using structured logging consistently

Example improvement:

-		if err != nil {
-			logrus.WithError(err).Errorf("Failed to get instance manager client for target instance manager %v", targetIM.Name)
-			return
-		}
+		if err != nil {
+			return errors.Wrapf(err, "failed to get instance manager client for target instance manager %v", targetIM.Name)
+		}

919-939: Add validation in isInstanceExist function.

The isInstanceExist function should validate the input parameters and handle edge cases.

Consider adding these validations:

 func (h *InstanceHandler) isInstanceExist(instanceName string, spec *longhorn.InstanceSpec) (bool, error) {
+	if instanceName == "" {
+		return false, fmt.Errorf("instance name is required")
+	}
+	if spec == nil {
+		return false, fmt.Errorf("instance spec is required")
+	}
+	if spec.NodeID == "" {
+		return false, fmt.Errorf("node ID is required")
+	}
+
 	var err error

841-859: Enhance error handling in target operations.

The target operations (switchOverTarget and deleteTarget) could benefit from more detailed error handling and logging.

Consider adding retries and more detailed error context:

 func (h *InstanceHandler) switchOverTarget(instanceName string, obj runtime.Object) error {
-	logrus.Infof("Switching over target for instance %v", instanceName)
+	log := logrus.WithField("instance", instanceName)
+	log.Info("Switching over target")

 	if err := h.instanceManagerHandler.SwitchOverTarget(obj); err != nil {
+		if types.ErrorIsNotFound(err) {
+			log.Warn("Target not found during switchover")
+			return nil
+		}
 		return errors.Wrapf(err, "failed to switch over target for instance %s", instanceName)
 	}

 	return nil
 }
types/types.go (1)

1271-1273: Consider adding input validation.

The function could benefit from validation to ensure neither prefix nor nodeID is empty.

 func GenerateNodeDataEngineUpgradeName(prefix, nodeID string) string {
+	if prefix == "" || nodeID == "" {
+		logrus.Warnf("Empty prefix or nodeID provided for node data engine upgrade name generation")
+	}
 	return prefix + "-" + nodeID + "-" + util.RandomID()
 }
controller/volume_controller_test.go (3)

506-510: Fix indentation for better readability

The indentation of these lines is inconsistent with the surrounding code block.

Apply this diff to fix the indentation:

-		e.Status.TargetIP = ""
-		e.Status.StorageIP = ""
-		e.Status.StorageTargetIP = ""
-		e.Status.Port = 0
-		e.Status.TargetPort = 0
+    e.Status.TargetIP = ""
+    e.Status.StorageIP = ""
+    e.Status.StorageTargetIP = ""
+    e.Status.Port = 0
+    e.Status.TargetPort = 0

Line range hint 1-100: Consider using constants for test values

The test setup uses hardcoded values like "2" for replica count and other test parameters. Consider extracting these into named constants at the package level for better maintainability and reusability.

Example:

+const (
+    TestDefaultReplicaCount = 2
+    TestDefaultFrontend     = longhorn.VolumeFrontendBlockDev
+)
+
 func generateVolumeTestCaseTemplate() *VolumeTestCase {
-    volume := newVolume(TestVolumeName, 2)
+    volume := newVolume(TestVolumeName, TestDefaultReplicaCount)

Line range hint 101-1000: Enhance test case documentation

While the test cases are comprehensive, consider adding detailed documentation for each test case to explain:

  • The specific scenario being tested
  • The expected behavior
  • Any important preconditions or assumptions

This will make the tests more maintainable and easier to understand for other developers.

Example:

 testCases["volume create"] = tc
+
+// Test case documentation:
+// Scenario: Basic volume creation with default settings
+// Expected: Volume should be created in Creating state with correct image
+// Preconditions: None
controller/node_controller.go (1)

2181-2189: Consider refactoring for better maintainability.

Suggestions to improve the code:

  1. Extract condition reasons into constants to avoid string literals
  2. Consider extracting the scheduling decision logic into a separate method for better readability and testability
+const (
+    NodeConditionReasonDataEngineUpgrade = "DataEngineUpgradeRequested"
+)

-	} else if node.Spec.DataEngineUpgradeRequested {
-		disableScheduling = true
-		reason = string(longhorn.NodeConditionReasonNodeDataEngineUpgradeRequested)
-		message = fmt.Sprintf("Data engine of node %v is being upgraded", node.Name)
+	} else if shouldDisableSchedulingForUpgrade(node) {
+		disableScheduling, reason, message = getUpgradeSchedulingState(node)
+	}

+func shouldDisableSchedulingForUpgrade(node *longhorn.Node) bool {
+    return node.Spec.DataEngineUpgradeRequested
+}
+
+func getUpgradeSchedulingState(node *longhorn.Node) (bool, string, string) {
+    return true,
+        string(longhorn.NodeConditionReasonNodeDataEngineUpgradeRequested),
+        fmt.Sprintf("Data engine of node %v is being upgraded", node.Name)
+}
controller/engine_controller.go (3)

437-467: LGTM with a minor suggestion for error messages

The implementation correctly handles both local and remote target scenarios with proper error handling. Consider making error messages more specific by including the object type in validation errors.

-        return nil, "", "", fmt.Errorf("invalid object for engine: %v", obj)
+        return nil, "", "", fmt.Errorf("expected *longhorn.Engine but got: %T", obj)

Line range hint 2419-2476: Well-structured v2 data engine upgrade implementation

The implementation thoroughly validates both initiator and target instances before proceeding with the upgrade. The status cleanup after upgrade is comprehensive.

Consider implementing a timeout mechanism for the upgrade process to prevent potential deadlocks in case either the initiator or target instance becomes unresponsive during the upgrade.


1041-1078: Consider simplifying error handling logic

The error handling for v2 data engine upgrades is comprehensive but could be simplified for better maintainability.

Consider extracting the upgrade check logic into a separate method:

+func (m *EngineMonitor) handleV2EngineUpgradeError(engine *longhorn.Engine, err error) error {
+    if !types.IsDataEngineV2(engine.Spec.DataEngine) || !apierrors.IsNotFound(errors.Cause(err)) {
+        return err
+    }
+
+    upgrading, err := m.ds.IsNodeDataEngineUpgradeRequested(engine.Spec.NodeID)
+    if err != nil {
+        return errors.Wrapf(err, "failed to check if engine %v is being upgraded", engine.Name)
+    }
+    if !upgrading {
+        return err
+    }
+
+    updated, err := m.isInstanceManagerUpdated(engine)
+    if err != nil {
+        return errors.Wrapf(err, "failed to check if the instance manager is updated")
+    }
+    if !updated {
+        m.logger.Warnf("v2 data engine %v is being upgraded, will retry updating status later", engine.Name)
+        return nil
+    }
+
+    return m.validateReplicaStates(engine)
+}
controller/volume_controller.go (1)

1007-1012: Improve v2 data engine replica image validation

The code correctly handles the case where replica engine image can be different from volume image for v2 data engine. However, the warning message could be more descriptive.

-    log.WithField("replica", r.Name).Warnf("Replica engine image %v is different from volume engine image %v, "+
-        "but replica spec.Active has been set", r.Spec.Image, v.Spec.Image)
+    log.WithField("replica", r.Name).Warnf("For v1 volume: replica engine image %v is different from volume engine image %v, "+
+        "but replica spec.Active has been set. This indicates a potential issue.", r.Spec.Image, v.Spec.Image)
controller/monitor/node_upgrade_monitor.go (4)

700-704: Simplify error handling and avoid unnecessary error wrapping

When handling errors, wrapping the error multiple times can make error messages confusing. Also, reassigning err can cause confusion.

Apply this diff to simplify error handling:

volume, err := m.ds.GetVolume(volumeName)
if err != nil {
-   err = errors.Wrapf(err, "failed to get volume %v for switch over", volumeName)
    return
}
+ if err != nil {
+    return errors.Wrapf(err, "failed to get volume %v for switch over", volumeName)
+ }

755-759: Simplify error handling and avoid unnecessary error wrapping

Similar to the previous comment, you can simplify error handling here.

Apply this diff:

volume, err := m.ds.GetVolume(volumeName)
if err != nil {
-   err = errors.Wrapf(err, "failed to get volume %v for switch over", volumeName)
    return
}
+ if err != nil {
+    return errors.Wrapf(err, "failed to get volume %v for switch over", volumeName)
+ }

193-206: Consider combining deferred functions and handling errors more cleanly

Having multiple deferred functions can make the code less readable. Consider combining them and simplifying error handling.

- defer func() {
-     if err != nil {
-         node, errGet := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
-         if errGet != nil {
-             log.WithError(errGet).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
-             return
-         }
-         node.Spec.DataEngineUpgradeRequested = false
-         if _, errUpdate := m.ds.UpdateNode(node); errUpdate != nil {
-             log.WithError(errUpdate).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
-             return
-         }
-     }
- }()
+ if err != nil {
+     if node, errGet := m.ds.GetNode(nodeUpgrade.Status.OwnerID); errGet != nil {
+         log.WithError(errGet).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
+     } else {
+         node.Spec.DataEngineUpgradeRequested = false
+         if _, errUpdate := m.ds.UpdateNode(node); errUpdate != nil {
+             log.WithError(errUpdate).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
+         }
+     }
+ }

213-215: Clarify log message and ensure correct synchronization

The log message states that DataEngineUpgradeRequested is set to true, but it's still schedulable. This could cause confusion. Additionally, ensure that returning here without setting err is intentional and won't cause issues in the upgrade flow.

Consider adjusting the log message for clarity:

log.Infof("Node %v has DataEngineUpgradeRequested set to true but is still schedulable", nodeUpgrade.Status.OwnerID)

And verify that the return statement here maintains the correct upgrade flow.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 49625f4 and ce1c3db.

📒 Files selected for processing (25)
  • controller/controller_manager.go (2 hunks)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • controller/monitor/upgrade_manager_monitor.go (1 hunks)
  • controller/node_controller.go (2 hunks)
  • controller/node_upgrade_controller.go (1 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/uninstall_controller.go (4 hunks)
  • controller/upgrade_manager_controller.go (1 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
🚧 Files skipped from review as they are similar to previous changes (11)
  • controller/controller_manager.go
  • controller/monitor/upgrade_manager_monitor.go
  • engineapi/instance_manager_test.go
  • scheduler/replica_scheduler.go
  • webhook/resources/dataengineupgrademanager/mutator.go
  • webhook/resources/dataengineupgrademanager/validator.go
  • webhook/resources/nodedataengineupgrade/mutator.go
  • webhook/resources/nodedataengineupgrade/validator.go
  • webhook/resources/volume/validator.go
  • webhook/server/mutation.go
  • webhook/server/validation.go
🧰 Additional context used
📓 Learnings (5)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
controller/instance_handler.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/instance_handler.go:919-939
Timestamp: 2024-11-25T23:56:53.252Z
Learning: In the Longhorn Manager Go code, the function `engineapi.NewInstanceManagerClient` does not accept a `context.Context` parameter. Therefore, we cannot pass `ctx` to it in functions like `isInstanceExist` in `controller/instance_handler.go`.
controller/monitor/node_upgrade_monitor.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/monitor/node_upgrade_monitor.go:351-357
Timestamp: 2024-11-25T23:55:02.080Z
Learning: `GetVolumeRO` guarantees that `volume` is non-nil when `err == nil`, so explicit nil checks after error handling are not needed.
controller/node_upgrade_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/node_upgrade_controller.go:135-139
Timestamp: 2024-11-26T00:03:56.311Z
Learning: In the Longhorn project, the constant `maxRetries` is defined in `base_controller.go` and should not be redefined in individual controller files.
controller/upgrade_manager_controller.go (2)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/upgrade_manager_controller.go:135-139
Timestamp: 2024-11-25T15:30:18.858Z
Learning: In the Longhorn Manager codebase, the variable `maxRetries` is defined in `base_controller.go`, and other controllers can use it without redefining it.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/node_upgrade_controller.go:135-139
Timestamp: 2024-11-26T00:03:56.311Z
Learning: In the Longhorn project, the constant `maxRetries` is defined in `base_controller.go` and should not be redefined in individual controller files.
🪛 golangci-lint (1.61.0)
controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (36)
controller/upgrade_manager_controller.go (4)

96-126: LGTM! Well-structured controller implementation

The queue management and worker implementation follow Kubernetes controller best practices with proper shutdown handling and worker concurrency.


146-156: LGTM! Well-structured helper functions

The helper functions are focused, well-named, and follow good logging practices.


202-209: ⚠️ Potential issue

Verify cleanup before removing finalizer

The finalizer is removed without verifying that all resources are properly cleaned up.

Add cleanup verification:

 if !upgradeManager.DeletionTimestamp.IsZero() {
+    if err := uc.verifyCleanup(upgradeManager); err != nil {
+        return err
+    }
     if uc.dataEngineUpgradeManagerMonitor != nil {
         uc.dataEngineUpgradeManagerMonitor.Close()
         uc.dataEngineUpgradeManagerMonitor = nil
     }
     return uc.ds.RemoveFinalizerForDataEngineUpgradeManager(upgradeManager)
 }

Likely invalid or redundant comment.


262-275: ⚠️ Potential issue

Add thread safety to monitor management

The monitor creation and access is not thread-safe, which could lead to race conditions in concurrent scenarios.

Use the previously suggested mutex:

 func (uc *DataEngineUpgradeManagerController) createDataEngineUpgradeManagerMonitor(upgradeManager *longhorn.DataEngineUpgradeManager) (monitor.Monitor, error) {
+    uc.monitorLock.Lock()
+    defer uc.monitorLock.Unlock()
     if uc.dataEngineUpgradeManagerMonitor != nil {
         return uc.dataEngineUpgradeManagerMonitor, nil
     }
     // ... rest of the function
 }

Likely invalid or redundant comment.

controller/node_upgrade_controller.go (1)

146-158: LGTM: Well-structured helper functions

The helper functions are well-implemented with proper logging and responsibility checks.

controller/instance_handler_test.go (3)

75-94: LGTM! Well-structured mock implementations.

The new mock methods are properly implemented with clear error messages indicating they are not mocked. The IsEngine implementation is correct and follows type assertion best practices.


Line range hint 39-51: Verify the impact of isInstanceOnRemoteNode parameter addition.

The addition of isInstanceOnRemoteNode parameter to GetInstance and CreateInstance methods should be tested with both true and false values to ensure proper handling of remote node instances.

Add test cases to verify remote node instance handling:

+"engine creates instance on remote node": {
+    longhorn.InstanceTypeEngine,
+    newInstanceManager(...),
+    newEngine(NonExistingInstance, "", "", TestNode2, "", 0, false, longhorn.InstanceStateStopped, longhorn.InstanceStateRunning),
+    newEngine(NonExistingInstance, "", "", TestNode2, "", 0, false, longhorn.InstanceStateStopped, longhorn.InstanceStateRunning),
+    false,
+},

136-140: Verify the initialization of new engine status fields.

The newEngine function has been updated with new fields (TargetIP, StorageIP, StorageTargetIP, TargetPort), but the test cases might not cover all scenarios where these fields are significant.

Consider adding specific test cases for these new fields:

+"engine updates target networking fields": {
+    longhorn.InstanceTypeEngine,
+    newInstanceManager(...),
+    newEngine(ExistingInstance, TestEngineImage, TestInstanceManagerName, TestNode1, TestIP1, TestPort1, true, longhorn.InstanceStateRunning, longhorn.InstanceStateRunning),
+    func(e *longhorn.Engine) *longhorn.Engine {
+        e.Status.TargetIP = TestIP2
+        e.Status.StorageIP = TestIP2
+        return e
+    }(newEngine(...)),
+    false,
+},
✅ Verification successful

Let me generate another verification script to check the test coverage specifically for these new engine status fields.


Based on the analysis of the code, I can now provide a final response:

The initialization of new engine status fields is properly covered in the test cases.

The test coverage for these fields is adequate because:

  • The fields are initialized in newEngine helper function in controller/instance_handler_test.go
  • Multiple test cases verify different states of these fields:
    • Running state tests verify proper IP and port assignments
    • Stopping state tests verify clearing of these fields
    • Error state handling tests verify field reset behavior
    • State transition tests verify proper field updates

The test case in the review comment is already covered by existing test scenarios, and additional test cases are not necessary since the fields are properly initialized and verified across various engine states and transitions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check test coverage for new engine status fields
# Expected: Find test cases that verify these fields

# Search for test cases using these fields
rg -A 5 "TargetIP|StorageIP|StorageTargetIP|TargetPort"

# Search for field usage in production code
ast-grep --pattern 'Status: longhorn.EngineStatus {
  $$$
  TargetIP: $_
  $$$
}'

Length of output: 81017


Script:

#!/bin/bash
# Search for test cases that verify these new engine status fields
rg -A 10 "newEngine.*\(" controller/

Length of output: 37658

controller/replica_controller.go (4)

Line range hint 355-528: LGTM!

The update to GetInstanceManagerByInstanceRO call is consistent with the method signature change.


611-635: Empty interface implementations need to be completed

The empty implementations suggest incomplete feature implementation.


Line range hint 636-647: Verify the impact of remote node instance retrieval

The addition of isInstanceOnRemoteNode parameter suggests this is part of remote instance management functionality. Ensure that all callers of this method are updated with the correct boolean value.

✅ Verification successful

Based on the search results, I can generate the final response as we have sufficient information to verify the impact of the changes.

All callers of GetInstance method have been properly updated with the isInstanceOnRemoteNode parameter

The search results show that all invocations of the GetInstance method across the codebase include the new boolean parameter. The usage is consistent and logical:

  • false is passed for local node instances
  • true is passed for remote node instances, particularly in switchover scenarios
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for GetInstance method calls to verify they've been updated
rg -A 2 "GetInstance\(" --type go

Length of output: 2814


Line range hint 316-355: Verify the impact of remote node instance creation

The addition of isInstanceOnRemoteNode parameter suggests this is part of remote instance management functionality. Ensure that all callers of this method are updated with the correct boolean value.

✅ Verification successful

All callers of CreateInstance method have been updated with the isInstanceOnRemoteNode parameter

The search results show that all implementations and calls of the CreateInstance method properly include and handle the isInstanceOnRemoteNode parameter:

  • Interface definition in controller/instance_handler.go includes the parameter
  • All implementations (ReplicaController, EngineController, MockInstanceManagerHandler) include the parameter
  • All calls to CreateInstance provide the boolean parameter with appropriate values:
    • false for local node instances
    • true for remote node instances
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for CreateInstance method calls to verify they've been updated
rg -A 2 "CreateInstance\(" --type go

Length of output: 3206

engineapi/instance_manager.go (4)

283-292: LGTM: New port fields added correctly

The addition of StandbyTargetPortStart and StandbyTargetPortEnd fields follows the existing naming convention and is properly mapped from the API response.


864-904: LGTM: Engine instance suspend/resume implementation

The implementation of EngineInstanceSuspend and EngineInstanceResume follows good practices:

  • Proper nil checks
  • Clear error messages
  • Consistent handling of different data engine types

906-930: LGTM: Engine instance target switchover implementation

The implementation includes proper validation and error handling:

  • Checks for nil Engine
  • Validates TargetAddress is not empty
  • Consistent error messaging

932-951: LGTM: Engine instance target deletion implementation

The implementation follows the established pattern:

  • Proper nil checks
  • Consistent error handling
  • Clear delegation to instance service client
controller/instance_handler.go (1)

38-47: LGTM: Interface changes align with v2 data engine live upgrade requirements.

The new methods and signature changes in InstanceManagerHandler interface provide a comprehensive set of operations needed for v2 data engine live upgrade functionality.

types/types.go (3)

43-44: LGTM: New constants follow naming conventions.

The new constants for data engine upgrade manager and node data engine upgrade follow the existing naming pattern and are appropriately placed within the constants block.


163-164: LGTM: New label constants are well-defined.

The new label constants for data engine upgrade follow the existing naming pattern and are appropriately placed within the labels block. The snapshot label for live upgrade is also correctly defined.

Also applies to: 193-193


1271-1291: LGTM: Well-structured utility functions for data engine upgrade.

The new utility functions follow existing patterns and provide necessary functionality for:

  • Generating unique names for node data engine upgrades
  • Creating consistent labels for upgrade manager and node upgrade resources
controller/uninstall_controller.go (3)

49-50: LGTM! Constants follow the established naming pattern.

The new CRD constants for data engine upgrade components are well-defined and consistent with existing constant naming conventions.


208-219: LGTM! Event handler registration follows existing patterns.

The event handler registration for new data engine upgrade components is properly implemented with:

  • Correct error handling for AddEventHandler calls
  • Proper cache sync registration
  • Consistent implementation pattern with other CRD handlers

653-666: LGTM! Proper integration of new deletion methods.

The integration of data engine upgrade deletion methods into the deleteCRs function is well-implemented with:

  • Proper error handling and logging
  • Consistent implementation pattern with other resource types
  • Clear log messages for tracking deletion progress
controller/volume_controller_test.go (1)

Line range hint 1-1000: Overall implementation looks good

The test implementation is thorough and well-structured, covering various volume lifecycle scenarios. The suggested improvements are minor and focused on maintainability.

controller/node_controller.go (2)

Line range hint 2177-2209: LGTM! The schedulable condition update logic looks correct.

The changes add proper handling of the data engine upgrade state by:

  1. Setting node as unschedulable when data engine upgrade is requested
  2. Using appropriate condition reason and message
  3. Maintaining existing cordon-based scheduling logic

The implementation aligns well with the PR objective of supporting v2 data engine live upgrades.


2181-2189: Verify the upgrade safety and scheduling behavior.

Please confirm:

  1. The scheduling restriction during data engine upgrade won't disrupt existing workloads
  2. The upgrade process handles concurrent scheduling requests properly
  3. The node remains functional for existing workloads while unschedulable
✅ Verification successful

Based on the search results, I can now generate a final response as I have sufficient information to verify the upgrade safety and scheduling behavior:

Scheduling restrictions during data engine upgrade are properly implemented and safe

The implementation ensures safety through multiple layers of protection:

  • The node becomes unschedulable during data engine upgrade to prevent new workload scheduling
  • Existing workloads remain functional as verified by:
    • Volume controller skips only replica replenishment during upgrade (controller/volume_controller.go)
    • Webhook validations prevent unsafe operations like volume attachment changes during upgrade
    • Clear error messages guide users about restricted operations

The code handles concurrent scheduling requests safely through:

  • Validation webhooks that reject invalid operations during upgrade
  • Proper state checks before allowing volume operations
  • Clear separation between upgrade state and volume operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential issues with scheduling during upgrades

# Check if there are any scheduling-related error logs during upgrades
rg -A 5 "DataEngineUpgradeRequested.*scheduling|scheduling.*DataEngineUpgradeRequested" 

# Look for any upgrade-related issues or race conditions
rg -A 5 "failed.*upgrade|error.*upgrade|upgrade.*failed"

Length of output: 43482

controller/engine_controller.go (1)

Line range hint 2545-2613: Well-implemented responsibility assignment logic

The method effectively handles ownership assignment considering node states, data engine availability, and delinquent nodes. The logic for transferring ownership is clear and robust.

controller/volume_controller.go (7)

1827-1836: LGTM: Improved replica state management for v2 data engine

The code correctly handles replica state transitions for both v1 and v2 data engines, with proper image validation.


1923-1930: LGTM: Enhanced engine state management for v2 data engine

The code properly handles engine state transitions based on image and target node alignment.


2236-2248: LGTM: Skip replica replenishment during node upgrade

The code correctly prevents replica replenishment when the node is being upgraded, which is essential for data engine v2 upgrades.


3301-3320: LGTM: Comprehensive volume eligibility check for live upgrades

The code properly validates volume state and robustness for both v1 and v2 data engines before allowing live upgrades.


3322-3327: LGTM: Clear v2 data engine upgrade completion check

The function clearly defines the conditions for considering a v2 data engine upgrade as complete.


3817-3822: LGTM: Proper v2 replica image handling

The code correctly handles the fact that v2 replica images can be empty and different from the volume image.


5086-5094: LGTM: Enhanced volume upgrade check

The code properly checks upgrade status for both v1 and v2 data engines, considering both image and target node changes.

controller/monitor/node_upgrade_monitor.go (2)

351-357: No need to check for nil volume after GetVolumeRO

As per the learnings, GetVolumeRO guarantees that volume is non-nil when err == nil. The existing code correctly handles this, so no action is needed.


330-335: 🛠️ Refactor suggestion

Avoid variable shadowing in deferred function

Similar to previous occurrences, err is used within a deferred function, which can cause confusion due to variable shadowing.

Apply this diff to use a different variable name:

defer func() {
    if err != nil {
-       m.nodeUpgradeStatus.Message = err.Error()
+       m.nodeUpgradeStatus.Message = err.Error()
    }
}()

Or if err is assigned within the deferred function:

defer func() {
    if err != nil {
        // ...
-       if _, err := m.ds.UpdateNode(node); err != nil {
-           log.WithError(err).Warnf("Failed to update node %v", nodeUpgrade.Status.OwnerID)
+       if _, updateErr := m.ds.UpdateNode(node); updateErr != nil {
+           log.WithError(updateErr).Warnf("Failed to update node %v", nodeUpgrade.Status.OwnerID)
            return
        }
    }
}()
⛔ Skipped due to learnings
Learnt from: james-munson
PR: longhorn/longhorn-manager#3211
File: app/post_upgrade.go:102-113
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In Go, when a deferred function references a variable like `err`, ensure that the variable is declared in the outer scope and not within an inner scope (such as within `if err := ...`), to prevent compilation errors and unintended variable shadowing.

Comment on lines +29 to +45
type DataEngineUpgradeManagerController struct {
*baseController

// which namespace controller is running with
namespace string
// use as the OwnerID of the controller
controllerID string

kubeClient clientset.Interface
eventRecorder record.EventRecorder

dataEngineUpgradeManagerMonitor monitor.Monitor

ds *datastore.DataStore

cacheSyncs []cache.InformerSynced
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add mutex protection for concurrent access to shared resources

The dataEngineUpgradeManagerMonitor field is accessed from multiple goroutines (worker routines) without synchronization, which could lead to race conditions.

Add mutex protection:

 type DataEngineUpgradeManagerController struct {
     *baseController
+    monitorLock sync.RWMutex
     // ... other fields ...
     dataEngineUpgradeManagerMonitor monitor.Monitor
     // ... remaining fields ...
 }

Committable suggestion skipped: line range outside the PR's diff.

controller/upgrade_manager_controller.go Outdated Show resolved Hide resolved
controller/node_upgrade_controller.go Outdated Show resolved Hide resolved
Comment on lines +100 to +102
func (imh *MockInstanceManagerHandler) RequireRemoteTargetInstance(obj interface{}) (bool, error) {
return false, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test coverage for RequireRemoteTargetInstance.

The RequireRemoteTargetInstance mock implementation always returns false, which might not cover all test scenarios.

Consider parameterizing the return value:

+type MockInstanceManagerHandler struct {
+    requireRemoteTarget bool
+}

-func (imh *MockInstanceManagerHandler) RequireRemoteTargetInstance(obj interface{}) (bool, error) {
-    return false, nil
-}
+func (imh *MockInstanceManagerHandler) RequireRemoteTargetInstance(obj interface{}) (bool, error) {
+    return imh.requireRemoteTarget, nil
+}

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +224 to +227
// TODO: should we return here or mark the instance as error?
logrus.WithError(err).Errorf("Failed to check if node %v is being upgrade requested", spec.NodeID)
return
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Resolve error handling uncertainty.

The TODO comment indicates uncertainty about error handling when checking node upgrade status. This should be resolved to ensure consistent behavior.

Apply this diff:

-            // TODO: should we return here or mark the instance as error?
-            logrus.WithError(err).Errorf("Failed to check if node %v is being upgrade requested", spec.NodeID)
-            return
+            logrus.WithError(err).Errorf("Failed to check if node %v is being upgrade requested", spec.NodeID)
+            status.CurrentState = longhorn.InstanceStateError
+            return fmt.Errorf("failed to check node upgrade status: %v", err)

Committable suggestion skipped: line range outside the PR's diff.

controller/uninstall_controller.go Outdated Show resolved Hide resolved
Comment on lines +704 to +763
func (ec *EngineController) SwitchOverTarget(obj interface{}) error {
e, ok := obj.(*longhorn.Engine)
if !ok {
return fmt.Errorf("invalid object for target switchover: %v", obj)
}

if !types.IsDataEngineV2(e.Spec.DataEngine) {
return fmt.Errorf("target switchover is not supported for data engine %v", e.Spec.DataEngine)
}

if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
return fmt.Errorf("missing parameters for target switchover: %+v", e)
}

initiatorInstance, err := ec.GetInstance(obj, false)
if err != nil {
return errors.Wrapf(err, "failed to get initiator instance %v for switchover", e.Name)
}

log := getLoggerForEngine(ec.logger, e)

log.Infof("Preparing to switch over target to node %v. Initiator instance port details: port=%v, targetPort=%v, standbyTargetPort=%v",
e.Spec.TargetNodeID, initiatorInstance.Status.PortStart, initiatorInstance.Status.TargetPortStart, initiatorInstance.Status.StandbyTargetPortStart)

targetInstance, err := ec.GetInstance(obj, true)
if err != nil {
return errors.Wrapf(err, "failed to get target instance %v for switchover", e.Name)
}
log.Infof("Preparing to switch over target to node %v. Target instance port details: port=%v, targetPort=%v, standbyTargetPort=%v",
e.Spec.TargetNodeID, targetInstance.Status.PortStart, targetInstance.Status.TargetPortStart, targetInstance.Status.StandbyTargetPortStart)

targetIM, err := ec.getTargetInstanceManagerForSwitchOver(e)
if err != nil {
return err
}

initiatorIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
if err != nil {
return err
}
c, err := engineapi.NewInstanceManagerClient(initiatorIM, false)
if err != nil {
return err
}
defer c.Close()

port := targetInstance.Status.TargetPortStart
if targetInstance.Status.StandbyTargetPortStart != 0 {
port = targetInstance.Status.StandbyTargetPortStart
}

log.Infof("Switching over target to %v:%v", targetIM.Status.IP, port)
return c.EngineInstanceSwitchOverTarget(&engineapi.EngineInstanceSwitchOverTargetRequest{
Engine: e,
TargetAddress: net.JoinHostPort(targetIM.Status.IP, fmt.Sprint(port)),
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add port validation before switchover

While the implementation is solid, there's a potential issue with port validation. The code should validate that the selected port is non-zero before attempting the switchover.

Add port validation before the switchover:

 if targetInstance.Status.StandbyTargetPortStart != 0 {
     port = targetInstance.Status.StandbyTargetPortStart
 }
+if port == 0 {
+    return fmt.Errorf("invalid port 0 for target switchover on engine %v", e.Name)
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (ec *EngineController) SwitchOverTarget(obj interface{}) error {
e, ok := obj.(*longhorn.Engine)
if !ok {
return fmt.Errorf("invalid object for target switchover: %v", obj)
}
if !types.IsDataEngineV2(e.Spec.DataEngine) {
return fmt.Errorf("target switchover is not supported for data engine %v", e.Spec.DataEngine)
}
if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
return fmt.Errorf("missing parameters for target switchover: %+v", e)
}
initiatorInstance, err := ec.GetInstance(obj, false)
if err != nil {
return errors.Wrapf(err, "failed to get initiator instance %v for switchover", e.Name)
}
log := getLoggerForEngine(ec.logger, e)
log.Infof("Preparing to switch over target to node %v. Initiator instance port details: port=%v, targetPort=%v, standbyTargetPort=%v",
e.Spec.TargetNodeID, initiatorInstance.Status.PortStart, initiatorInstance.Status.TargetPortStart, initiatorInstance.Status.StandbyTargetPortStart)
targetInstance, err := ec.GetInstance(obj, true)
if err != nil {
return errors.Wrapf(err, "failed to get target instance %v for switchover", e.Name)
}
log.Infof("Preparing to switch over target to node %v. Target instance port details: port=%v, targetPort=%v, standbyTargetPort=%v",
e.Spec.TargetNodeID, targetInstance.Status.PortStart, targetInstance.Status.TargetPortStart, targetInstance.Status.StandbyTargetPortStart)
targetIM, err := ec.getTargetInstanceManagerForSwitchOver(e)
if err != nil {
return err
}
initiatorIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
if err != nil {
return err
}
c, err := engineapi.NewInstanceManagerClient(initiatorIM, false)
if err != nil {
return err
}
defer c.Close()
port := targetInstance.Status.TargetPortStart
if targetInstance.Status.StandbyTargetPortStart != 0 {
port = targetInstance.Status.StandbyTargetPortStart
}
log.Infof("Switching over target to %v:%v", targetIM.Status.IP, port)
return c.EngineInstanceSwitchOverTarget(&engineapi.EngineInstanceSwitchOverTargetRequest{
Engine: e,
TargetAddress: net.JoinHostPort(targetIM.Status.IP, fmt.Sprint(port)),
})
}
func (ec *EngineController) SwitchOverTarget(obj interface{}) error {
e, ok := obj.(*longhorn.Engine)
if !ok {
return fmt.Errorf("invalid object for target switchover: %v", obj)
}
if !types.IsDataEngineV2(e.Spec.DataEngine) {
return fmt.Errorf("target switchover is not supported for data engine %v", e.Spec.DataEngine)
}
if e.Spec.VolumeName == "" || e.Spec.NodeID == "" {
return fmt.Errorf("missing parameters for target switchover: %+v", e)
}
initiatorInstance, err := ec.GetInstance(obj, false)
if err != nil {
return errors.Wrapf(err, "failed to get initiator instance %v for switchover", e.Name)
}
log := getLoggerForEngine(ec.logger, e)
log.Infof("Preparing to switch over target to node %v. Initiator instance port details: port=%v, targetPort=%v, standbyTargetPort=%v",
e.Spec.TargetNodeID, initiatorInstance.Status.PortStart, initiatorInstance.Status.TargetPortStart, initiatorInstance.Status.StandbyTargetPortStart)
targetInstance, err := ec.GetInstance(obj, true)
if err != nil {
return errors.Wrapf(err, "failed to get target instance %v for switchover", e.Name)
}
log.Infof("Preparing to switch over target to node %v. Target instance port details: port=%v, targetPort=%v, standbyTargetPort=%v",
e.Spec.TargetNodeID, targetInstance.Status.PortStart, targetInstance.Status.TargetPortStart, targetInstance.Status.StandbyTargetPortStart)
targetIM, err := ec.getTargetInstanceManagerForSwitchOver(e)
if err != nil {
return err
}
initiatorIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
if err != nil {
return err
}
c, err := engineapi.NewInstanceManagerClient(initiatorIM, false)
if err != nil {
return err
}
defer c.Close()
port := targetInstance.Status.TargetPortStart
if targetInstance.Status.StandbyTargetPortStart != 0 {
port = targetInstance.Status.StandbyTargetPortStart
}
if port == 0 {
return fmt.Errorf("invalid port 0 for target switchover on engine %v", e.Name)
}
log.Infof("Switching over target to %v:%v", targetIM.Status.IP, port)
return c.EngineInstanceSwitchOverTarget(&engineapi.EngineInstanceSwitchOverTargetRequest{
Engine: e,
TargetAddress: net.JoinHostPort(targetIM.Status.IP, fmt.Sprint(port)),
})
}

Comment on lines +174 to +180
defer func() {
if err != nil {
m.nodeUpgradeStatus.Message = err.Error()
}
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid variable shadowing of err inside deferred functions

Within the deferred function, err is redefined or used from an outer scope, potentially causing confusion and unintended behavior. It's better to use a different variable name inside the deferred function.

Apply this diff to use separate variable names:

defer func() {
-    if err != nil {
+    if deferErr := err; deferErr != nil {
        m.nodeUpgradeStatus.Message = err.Error()
    }
}()

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +192 to +208
defer func() {
if err != nil {
node, errGet := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
if errGet != nil {
log.WithError(errGet).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
return
}

node.Spec.DataEngineUpgradeRequested = false
if _, errUpdate := m.ds.UpdateNode(node); errUpdate != nil {
log.WithError(errUpdate).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
return
}
}
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid variable shadowing and handle errors properly in deferred function

In the deferred function, err is being reassigned when calling m.ds.GetNode(nodeUpgrade.Status.OwnerID), which can shadow the outer err variable and lead to unintended side effects.

Apply this diff to prevent variable shadowing:

defer func() {
    if err != nil {
-       node, err := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
-       if err != nil {
-           log.WithError(err).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
+       node, getNodeErr := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
+       if getNodeErr != nil {
+           log.WithError(getNodeErr).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
            return
        }
        node.Spec.DataEngineUpgradeRequested = false
-       if _, err := m.ds.UpdateNode(node); err != nil {
-           log.WithError(err).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
+       if _, updateErr := m.ds.UpdateNode(node); updateErr != nil {
+           log.WithError(updateErr).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
            return
        }
    }
}()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
defer func() {
if err != nil {
node, errGet := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
if errGet != nil {
log.WithError(errGet).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
return
}
node.Spec.DataEngineUpgradeRequested = false
if _, errUpdate := m.ds.UpdateNode(node); errUpdate != nil {
log.WithError(errUpdate).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
return
}
}
}()
defer func() {
if err != nil {
node, getNodeErr := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
if getNodeErr != nil {
log.WithError(getNodeErr).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
return
}
node.Spec.DataEngineUpgradeRequested = false
if _, updateErr := m.ds.UpdateNode(node); updateErr != nil {
log.WithError(updateErr).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
return
}
}
}()

if err != nil {
return err
}
defer engineClientProxy.Close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid deferring resource closures within loops to prevent resource leaks

Deferring engineClientProxy.Close() inside a loop causes all Close() calls to be postponed until the entire snapshotVolumes function returns. This can lead to resource exhaustion if the loop iterates many times.

Apply this diff to close the engineClientProxy after each iteration:

engineClientProxy, err := engineapi.GetCompatibleClient(engine, engineCliClient, m.ds, m.logger, m.proxyConnCounter)
if err != nil {
    return err
}
- defer engineClientProxy.Close()

// ... existing code ...

_, err = engineClientProxy.SnapshotCreate(engine, m.upgradeManagerName+"-"+util.RandomID(), snapLabels, freezeFilesystem)
if err != nil {
    return err
}
+ // Close the proxy before the next iteration
+ engineClientProxy.Close()

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (33)
webhook/resources/nodedataengineupgrade/validator.go (1)

55-58: Simplify error handling by removing unnecessary variables

In the Create method, you can simplify the error handling by directly returning the error without assigning it to an intermediate variable err.

Apply this diff to simplify the code:

 if nodeUpgrade.Spec.InstanceManagerImage == "" {
-    err := fmt.Errorf("instanceManagerImage is required")
-    return werror.NewInvalidError(err.Error(), "spec.instanceManagerImage")
+    return werror.NewInvalidError("instanceManagerImage is required", "spec.instanceManagerImage")
 }

 if nodeUpgrade.Spec.DataEngineUpgradeManager == "" {
-    err := fmt.Errorf("dataEngineUpgradeManager is required")
-    return werror.NewInvalidError(err.Error(), "spec.dataEngineUpgradeManager")
+    return werror.NewInvalidError("dataEngineUpgradeManager is required", "spec.dataEngineUpgradeManager")
 }

Also applies to: 60-63

k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (2)

47-53: Consider adding type assertion safety check

The type assertion m.(*v1beta2.DataEngineUpgradeManager) could panic if the cache contains objects of the wrong type. Consider adding a type assertion safety check:

 func (s *dataEngineUpgradeManagerLister) List(selector labels.Selector) (ret []*v1beta2.DataEngineUpgradeManager, err error) {
 	err = cache.ListAll(s.indexer, selector, func(m interface{}) {
-		ret = append(ret, m.(*v1beta2.DataEngineUpgradeManager))
+		obj, ok := m.(*v1beta2.DataEngineUpgradeManager)
+		if !ok {
+			return
+		}
+		ret = append(ret, obj)
 	})
 	return ret, err
 }

76-82: Consider adding type assertion safety check in namespace lister

Similar to the global lister, the type assertion in the namespace-specific List method could be made safer.

 func (s dataEngineUpgradeManagerNamespaceLister) List(selector labels.Selector) (ret []*v1beta2.DataEngineUpgradeManager, err error) {
 	err = cache.ListAllByNamespace(s.indexer, s.namespace, selector, func(m interface{}) {
-		ret = append(ret, m.(*v1beta2.DataEngineUpgradeManager))
+		obj, ok := m.(*v1beta2.DataEngineUpgradeManager)
+		if !ok {
+			return
+		}
+		ret = append(ret, obj)
 	})
 	return ret, err
 }
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (3)

95-97: LGTM: Well-documented node targeting for upgrades

The TargetNodeID field is properly documented and follows API conventions. Consider implementing validation to ensure:

  1. The target node exists and is ready
  2. The target node supports the required data engine version

Line range hint 116-133: Document the relationship between target fields

The new target-related fields (TargetIP, StorageTargetIP, TargetPort, etc.) would benefit from additional documentation explaining:

  • The relationship between these fields
  • Their lifecycle during the upgrade process
  • Valid state transitions

155-157: Consider adding port range validation

The new standby port fields should include validation to ensure:

  1. Start port is less than end port
  2. No overlap with primary port ranges
  3. Ports are within valid range (0-65535)
k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (1)

33-51: LGTM: Complete interface definition for NodeDataEngineUpgrade CRD

The interfaces follow Kubernetes client-gen patterns and include all necessary operations for managing NodeDataEngineUpgrade resources, including status updates.

The presence of UpdateStatus indicates this CRD has a status subresource, which is a good practice for resources that need to track their state. Ensure the NodeDataEngineUpgrade controller properly updates the status to reflect the progress of data engine upgrades.

k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (1)

Line range hint 1-1: Reminder: This is a generated file.

This file is generated by informer-gen. Any changes should be made to the source templates/definitions rather than directly modifying this file.

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (1)

1-195: Consider upgrade orchestration and failure handling

While the client implementation is solid, consider the following architectural aspects for the broader feature:

  1. Ensure proper orchestration of upgrades across multiple nodes to prevent cluster-wide disruptions
  2. Implement proper failure handling and rollback mechanisms
  3. Consider adding metrics for monitoring upgrade progress and status
  4. Document the upgrade workflow and potential failure scenarios

Would you like assistance in designing these architectural improvements?

controller/node_upgrade_controller.go (4)

57-60: Track TODO comment as a technical debt item

The TODO comment indicates a pending task to remove the event sink wrapper. This should be tracked in the issue tracker to ensure it's not forgotten.

Would you like me to create a GitHub issue to track this technical debt item?


154-158: Add documentation for ownership determination logic

The isResponsibleFor method uses a shared helper function, but its behavior and implications aren't immediately clear. Consider adding documentation that explains:

  • The purpose of preferredOwnerID
  • The ownership transfer conditions
  • The relationship with the shared isControllerResponsibleFor helper

267-273: Add validation for volume status updates

When copying volume statuses, consider adding validation to ensure that:

  • Volume names (keys) are valid
  • Status states are valid enum values
  • Messages are not excessively long

This will help prevent invalid states from being persisted.


29-45: Consider adding monitoring and circuit breaker mechanisms

To improve the robustness of the upgrade process, consider:

  1. Adding Prometheus metrics to track:

    • Upgrade success/failure rates
    • Duration of upgrades
    • Number of retries
    • Current upgrade state distributions
  2. Implementing a circuit breaker to:

    • Prevent cascading failures
    • Auto-pause upgrades if failure rate exceeds threshold
    • Allow manual intervention when needed
controller/monitor/node_upgrade_monitor.go (3)

24-24: Add a comment explaining the sync period choice.

Consider adding a comment explaining why 3 seconds was chosen as the sync period, to help future maintainers understand the rationale.


41-69: Add validation for required parameters in constructor.

The constructor should validate that required parameters (logger, ds, nodeUpgradeName, nodeID, syncCallback) are not nil/empty before proceeding.

 func NewNodeDataEngineUpgradeMonitor(logger logrus.FieldLogger, ds *datastore.DataStore, nodeUpgradeName, nodeID string, syncCallback func(key string)) (*NodeDataEngineUpgradeMonitor, error) {
+	if logger == nil {
+		return nil, errors.New("logger is required")
+	}
+	if ds == nil {
+		return nil, errors.New("datastore is required")
+	}
+	if nodeUpgradeName == "" {
+		return nil, errors.New("nodeUpgradeName is required")
+	}
+	if nodeID == "" {
+		return nil, errors.New("nodeID is required")
+	}
+	if syncCallback == nil {
+		return nil, errors.New("syncCallback is required")
+	}
+
 	nodeUpgrade, err := ds.GetNodeDataEngineUpgradeRO(nodeUpgradeName)

108-112: Enhance error logging with additional context.

When logging errors, include relevant fields to help with debugging.

 	nodeUpgrade, err := m.ds.GetNodeDataEngineUpgrade(m.nodeUpgradeName)
 	if err != nil {
-		return errors.Wrapf(err, "failed to get longhorn nodeDataEngineUpgrade %v", m.nodeUpgradeName)
+		return errors.Wrapf(err, "failed to get longhorn nodeDataEngineUpgrade %v, ownerID: %v", 
+			m.nodeUpgradeName, m.nodeUpgradeStatus.OwnerID)
 	}
engineapi/instance_manager.go (2)

532-555: Enhance error messages with more context

The error messages could be more descriptive by including the actual invalid address that caused the error.

Consider applying this diff:

-		return nil, errors.New("invalid initiator address format")
+		return nil, fmt.Errorf("invalid initiator address format: %s", initiatorAddress)

-		return nil, errors.New("invalid target address format")
+		return nil, fmt.Errorf("invalid target address format: %s", targetAddress)

-			return nil, errors.New("invalid replica address format")
+			return nil, fmt.Errorf("invalid replica address format: %s", addr)

864-951: LGTM! Consider adding method documentation.

The new request types and methods are well-implemented with proper validation and error handling. However, it would be helpful to add documentation comments explaining the purpose and behavior of each method, especially regarding:

  • When these methods should be called in the upgrade process
  • The expected state of the engine before and after each operation
  • Any side effects or important considerations

Example documentation:

// EngineInstanceSuspendRequest contains the parameters for suspending an engine instance
type EngineInstanceSuspendRequest struct {
    // Engine is the engine instance to be suspended
    Engine *longhorn.Engine
}

// EngineInstanceSuspend suspends an engine instance, temporarily halting I/O operations.
// This is typically called during the live upgrade process to prepare for target switchover.
// The engine must be in a running state before suspension.
func (c *InstanceManagerClient) EngineInstanceSuspend(req *EngineInstanceSuspendRequest) error {
controller/instance_handler.go (3)

58-165: Consider breaking down the syncStatusIPsAndPorts function

The function is handling multiple responsibilities including IP synchronization, port management, and target instance status updates. Consider breaking it down into smaller, focused functions for better maintainability.

Suggested refactoring:

-func (h *InstanceHandler) syncStatusIPsAndPorts(im *longhorn.InstanceManager...) {
+func (h *InstanceHandler) syncStatusIPsAndPorts(im *longhorn.InstanceManager...) {
+    h.syncInstanceIPs(im, status)
+    h.syncInstancePorts(instance, status)
+    
+    if !h.instanceManagerHandler.IsEngine(obj) {
+        return
+    }
+    
+    if types.IsDataEngineV2(spec.DataEngine) && spec.TargetNodeID != "" {
+        h.syncTargetInstanceStatus(spec, status, instanceName)
+    } else {
+        h.syncLocalTargetStatus(im, status)
+    }
+}

716-795: Reduce code duplication in createInstance

The function has duplicate error handling and event recording logic between V1 and V2 paths.

Consider extracting common error handling:

+func (h *InstanceHandler) handleInstanceCreationError(instanceName string, obj runtime.Object, err error) error {
+    if !types.ErrorAlreadyExists(err) {
+        h.eventRecorder.Eventf(obj, corev1.EventTypeWarning, constant.EventReasonFailedStarting, 
+            "Error starting %v: %v", instanceName, err)
+        return err
+    }
+    return nil
+}

 func (h *InstanceHandler) createInstance(instanceName string, dataEngine longhorn.DataEngineType, obj runtime.Object) (err error) {
     // ... existing code ...
-    if !types.ErrorAlreadyExists(err) {
-        h.eventRecorder.Eventf(obj, corev1.EventTypeWarning, constant.EventReasonFailedStarting, 
-            "Error starting %v: %v", instanceName, err)
-        return err
-    }
+    if err := h.handleInstanceCreationError(instanceName, obj, err); err != nil {
+        return err
+    }

955-983: Enhance logging in isTargetInstanceReplacementRunning

While the implementation is solid, adding more detailed logging would help with debugging and monitoring the target instance replacement process.

Consider adding more detailed logs:

 func (h *InstanceHandler) isTargetInstanceReplacementRunning(instanceName string, spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) (bool, error) {
     if spec.TargetNodeID == "" {
+        logrus.Debugf("No target node specified for instance %v", instanceName)
         return false, nil
     }
 
     logrus.Infof("Checking whether instance %v is running on target node %v", instanceName, spec.TargetNodeID)
+    
+    if isVolumeBeingSwitchedBack(spec, status) {
+        logrus.Debugf("Volume is being switched back for instance %v", instanceName)
+    }
types/types.go (1)

1271-1273: Add input validation for prefix and nodeID parameters.

Consider adding validation to ensure that prefix and nodeID are not empty strings before generating the name.

 func GenerateNodeDataEngineUpgradeName(prefix, nodeID string) string {
+	if prefix == "" || nodeID == "" {
+		return ""
+	}
 	return prefix + "-" + nodeID + "-" + util.RandomID()
 }
controller/engine_controller.go (1)

Line range hint 2419-2476: Consider enhancing logging for better observability.

The v2 data engine upgrade implementation is solid with comprehensive checks. However, consider adding debug logs for state transitions to aid in troubleshooting.

Add debug logs before key state transitions:

 if types.IsDataEngineV2(e.Spec.DataEngine) {
+    log.Debug("Starting v2 data engine upgrade checks")
     // Check if the initiator instance is running
     im, err := ec.ds.GetRunningInstanceManagerByNodeRO(e.Spec.NodeID, longhorn.DataEngineTypeV2)
     if err != nil {
         return err
     }
+    log.Debugf("Found initiator instance manager %v in state %v", im.Name, im.Status.CurrentState)
k8s/crds.yaml (2)

2582-2584: Enhance field description for dataEngineUpgradeRequested

The description could be more detailed to better explain the purpose and implications of this field.

Apply this diff to improve the description:

              dataEngineUpgradeRequested:
-               description: Request to upgrade the instance manager for v2 volumes on the node.
+               description: Indicates whether an upgrade to v2 data engine has been requested for this node. When set to true, the system will initiate the upgrade process for all eligible volumes on this node using the instance manager.
                type: boolean

Based on the verification results, I can now provide the final response:

Inconsistency found in data engine enum values across CRDs

The data engine enums are inconsistent across different CRDs:

  • DataEngineUpgradeManager CRD only allows v2 value
  • Other CRDs (like Engine) allow both v1 and v2 values
  • Some CRDs have the data engine field without enum restrictions

This inconsistency could lead to validation issues during upgrades.

🔗 Analysis chain

Line range hint 1313-3316: Verify CRD versioning consistency

Let's verify that all upgrade-related CRDs use consistent API versions and naming conventions.


Let me gather more information to verify the version consistency thoroughly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check version consistency across upgrade-related CRDs
# Expected results: All CRDs should use v1beta2 version

echo "Checking CRD versions..."
for crd in "dataengineupgrademanagers" "nodedataengineupgrades"; do
  echo "Checking $crd..."
  grep -A 5 "name: $crd" k8s/crds.yaml | grep "name: v1beta"
done

echo "Checking naming conventions..."
grep -E "kind: (DataEngineUpgradeManager|NodeDataEngineUpgrade)" k8s/crds.yaml

Length of output: 734


Script:

#!/bin/bash
# Check version fields in the upgrade-related CRDs more comprehensively

echo "Checking storage version for DataEngineUpgradeManager..."
grep -A 30 "versions:" k8s/crds.yaml | grep -B 5 -A 5 "storage: true" | grep -B 5 "name: v1beta"

echo -e "\nChecking version fields in spec for both CRDs..."
grep -A 2 "apiVersion:" k8s/crds.yaml | grep -B 2 "kind: CustomResourceDefinition"

echo -e "\nChecking data engine enum values consistency..."
grep -A 5 "dataEngine:" k8s/crds.yaml

Length of output: 3990

controller/volume_controller.go (5)

1007-1012: Improve v2 data engine replica cleanup logic

The v2 data engine replica cleanup logic needs to be more explicit about why the replica image can be different from the volume image.

-    // For a v2 volume, the instance manager image of a replica can be different from the one of its volume
+    // For a v2 volume, the replica image is determined by the instance manager image rather than the volume image,
+    // so we skip the image check during cleanup
     if types.IsDataEngineV1(v.Spec.DataEngine) {
         // r.Spec.Active shouldn't be set for the leftover replicas, something must wrong
         log.WithField("replica", r.Name).Warnf("Replica engine image %v is different from volume engine image %v, "+

1827-1836: Improve replica state handling for v2 data engine

The v2 data engine replica state handling is correct but could use better comments to explain the logic.

     if r.Spec.FailedAt == "" {
         if r.Status.CurrentState == longhorn.InstanceStateStopped {
             if types.IsDataEngineV1(e.Spec.DataEngine) {
                 if r.Spec.Image == v.Status.CurrentImage {
                     r.Spec.DesireState = longhorn.InstanceStateRunning
                 }
             } else {
+                // For v2 data engine, replica image is managed by instance manager
+                // so we can start the replica regardless of the volume image
                 r.Spec.DesireState = longhorn.InstanceStateRunning
             }
         }

3817-3823: Improve v2 replica image handling documentation

The v2 replica image handling logic is correct but needs better documentation.

     image := v.Status.CurrentImage
     if types.IsDataEngineV2(v.Spec.DataEngine) {
-        // spec.image of v2 replica can be empty and different from the volume image,
-        // because the image of a v2 replica is the same as the running instance manager.
+        // For v2 data engine:
+        // 1. Replica image is determined by the instance manager running it
+        // 2. We set spec.image to empty to allow instance manager to inject its image
+        // 3. This differs from v1 where replica image must match volume image
         image = ""
     }

5086-5094: Improve isVolumeUpgrading function for v2 data engine

The function correctly handles both v1 and v2 upgrade checks but could use better documentation.

 func isVolumeUpgrading(v *longhorn.Volume) bool {
     imageNotUpdated := v.Status.CurrentImage != v.Spec.Image
 
+    // v1: Only check image version
     if types.IsDataEngineV1(v.Spec.DataEngine) {
         return imageNotUpdated
     }
 
+    // v2: Check both image version and target node changes
+    // This is required because v2 upgrades involve both image updates
+    // and potential target node migrations
     return imageNotUpdated || v.Spec.TargetNodeID != v.Status.CurrentTargetNodeID
 }

1619-1619: Improve error logging for volume dependent resources

The warning message could be more descriptive about the implications.

-    log.WithField("e.Status.CurrentState", e.Status.CurrentState).Warn("Volume is attached but dependent resources are not opened")
+    log.WithField("e.Status.CurrentState", e.Status.CurrentState).Warn("Volume is attached but dependent resources (engine/replicas) are not in expected state, this may impact volume operations")
datastore/longhorn.go (3)

Line range hint 3742-3782: Refactoring improves code organization but needs validation

The refactoring of GetInstanceManagerByInstance() improves readability by extracting logic into helper functions. However, the nodeID assignment for engines needs careful validation to ensure it handles all edge cases correctly.

Consider adding validation to ensure nodeID is never empty when isInstanceOnRemoteNode is true:

 if isInstanceOnRemoteNode {
+    if obj.Spec.TargetNodeID == "" {
+        return nil, fmt.Errorf("invalid request: no target node ID specified for remote instance %v", name) 
+    }
     nodeID = obj.Spec.TargetNodeID
 }

5765-5876: Node data engine upgrade implementation needs error handling enhancement

The NodeDataEngineUpgrade CRUD operations are well-structured but could benefit from additional error handling.

Consider adding validation in CreateNodeDataEngineUpgrade:

func (s *DataStore) CreateNodeDataEngineUpgrade(nodeUpgrade *longhorn.NodeDataEngineUpgrade) (*longhorn.NodeDataEngineUpgrade, error) {
+    if nodeUpgrade.Spec.NodeName == "" {
+        return nil, fmt.Errorf("invalid NodeDataEngineUpgrade: node name is required")
+    }
     ret, err := s.lhClient.LonghornV1beta2().NodeDataEngineUpgrades(s.namespace).Create(...)

5998-6005: Node upgrade check helper needs additional validation

The IsNodeDataEngineUpgradeRequested helper is concise but could benefit from additional validation.

Consider adding validation for node existence:

func (s *DataStore) IsNodeDataEngineUpgradeRequested(name string) (bool, error) {
+    if name == "" {
+        return false, fmt.Errorf("node name is required")
+    }
     node, err := s.GetNodeRO(name)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce1c3db and eda6a7b.

📒 Files selected for processing (50)
  • controller/backup_controller.go (1 hunks)
  • controller/controller_manager.go (2 hunks)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • controller/monitor/upgrade_manager_monitor.go (1 hunks)
  • controller/node_controller.go (2 hunks)
  • controller/node_upgrade_controller.go (1 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/uninstall_controller.go (4 hunks)
  • controller/upgrade_manager_controller.go (1 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • datastore/datastore.go (3 hunks)
  • datastore/longhorn.go (6 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • k8s/crds.yaml (84 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (6 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/node.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/register.go (1 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/volume.go (2 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go (2 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go (3 hunks)
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/generic.go (2 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (4 hunks)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (1 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go (2 hunks)
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go (1 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
✅ Files skipped from review due to trivial changes (2)
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/client/listers/longhorn/v1beta2/nodedataengineupgrade.go
🚧 Files skipped from review as they are similar to previous changes (27)
  • controller/controller_manager.go
  • controller/instance_handler_test.go
  • controller/node_controller.go
  • controller/replica_controller.go
  • controller/uninstall_controller.go
  • controller/volume_controller_test.go
  • datastore/datastore.go
  • engineapi/instance_manager_test.go
  • k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/apis/longhorn/v1beta2/node.go
  • k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go
  • k8s/pkg/apis/longhorn/v1beta2/register.go
  • k8s/pkg/apis/longhorn/v1beta2/volume.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_dataengineupgrademanager.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_longhorn_client.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/fake/fake_nodedataengineupgrade.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/generated_expansion.go
  • k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/longhorn_client.go
  • k8s/pkg/client/informers/externalversions/longhorn/v1beta2/dataengineupgrademanager.go
  • k8s/pkg/client/listers/longhorn/v1beta2/expansion_generated.go
  • scheduler/replica_scheduler.go
  • webhook/resources/dataengineupgrademanager/mutator.go
  • webhook/resources/dataengineupgrademanager/validator.go
  • webhook/resources/nodedataengineupgrade/mutator.go
  • webhook/resources/volume/validator.go
  • webhook/server/mutation.go
  • webhook/server/validation.go
🧰 Additional context used
📓 Learnings (7)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
controller/instance_handler.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/instance_handler.go:919-939
Timestamp: 2024-11-25T23:56:53.252Z
Learning: In the Longhorn Manager Go code, the function `engineapi.NewInstanceManagerClient` does not accept a `context.Context` parameter. Therefore, we cannot pass `ctx` to it in functions like `isInstanceExist` in `controller/instance_handler.go`.
controller/monitor/node_upgrade_monitor.go (2)
Learnt from: james-munson
PR: longhorn/longhorn-manager#3211
File: app/post_upgrade.go:102-113
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In Go, when a deferred function references a variable like `err`, ensure that the variable is declared in the outer scope and not within an inner scope (such as within `if err := ...`), to prevent compilation errors and unintended variable shadowing.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/monitor/node_upgrade_monitor.go:351-357
Timestamp: 2024-11-25T23:55:02.080Z
Learning: `GetVolumeRO` guarantees that `volume` is non-nil when `err == nil`, so explicit nil checks after error handling are not needed.
controller/monitor/upgrade_manager_monitor.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/monitor/upgrade_manager_monitor.go:237-254
Timestamp: 2024-11-26T00:12:16.791Z
Learning: In the `handleUpgradeStateInitializing` function of `upgrade_manager_monitor.go`, when upgrading instance managers, we aim to upgrade old v2 instance managers rather than v1. Therefore, we should list instance managers with `longhorn.DataEngineTypeV2`.
controller/node_upgrade_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/node_upgrade_controller.go:135-139
Timestamp: 2024-11-26T00:03:56.311Z
Learning: In the Longhorn project, the constant `maxRetries` is defined in `base_controller.go` and should not be redefined in individual controller files.
controller/upgrade_manager_controller.go (2)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/upgrade_manager_controller.go:135-139
Timestamp: 2024-11-25T15:30:18.858Z
Learning: In the Longhorn Manager codebase, the variable `maxRetries` is defined in `base_controller.go`, and other controllers can use it without redefining it.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/node_upgrade_controller.go:135-139
Timestamp: 2024-11-26T00:03:56.311Z
Learning: In the Longhorn project, the constant `maxRetries` is defined in `base_controller.go` and should not be redefined in individual controller files.
webhook/resources/nodedataengineupgrade/validator.go (3)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: webhook/resources/dataengineupgrademanager/validator.go:41-53
Timestamp: 2024-11-26T00:01:13.256Z
Learning: In `webhook/resources/dataengineupgrademanager/validator.go` of the Longhorn project, when validating `DataEngineUpgradeManager` creation, the `upgradeManager.Spec.Nodes` field can be empty, indicating that all available nodes should be upgraded.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: webhook/resources/nodedataengineupgrade/validator.go:46-48
Timestamp: 2024-11-26T00:07:32.856Z
Learning: In `webhook/resources/nodedataengineupgrade/validator.go`, it's unnecessary to check if `NodeDataEngineUpgrade.Spec` is nil before accessing its fields, as it is guaranteed to be non-nil.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: webhook/resources/nodedataengineupgrade/validator.go:78-93
Timestamp: 2024-11-25T23:56:04.203Z
Learning: In the `Update` method of `nodeDataEngineUpgradeValidator` in `validator.go`, `Spec` is a struct and cannot be nil, so nil checks before accessing `Spec` fields are unnecessary.
🪛 golangci-lint (1.62.2)
controller/volume_controller.go

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3221-3221: controller/volume_controller.go#L3221
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (38)
webhook/resources/nodedataengineupgrade/validator.go (1)

17-95: Validation logic for NodeDataEngineUpgrade is well-implemented

The validator correctly handles creation and update operations, ensuring required fields are present and immutable fields are enforced. Good job on following the correct validation patterns.

k8s/pkg/client/listers/longhorn/v1beta2/dataengineupgrademanager.go (2)

1-26: LGTM: Standard Kubernetes file header and imports

The file header, license, and imports follow the standard Kubernetes conventions.


84-94: LGTM: Proper error handling in Get method

The Get method properly handles errors and uses the correct Kubernetes error types for not-found cases.

k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (1)

61-67: LGTM: New instance state for upgrade support

The addition of InstanceStateSuspended state aligns with the live upgrade feature requirements.

Let's verify the state handling in the codebase:

✅ Verification successful

The new suspended state is properly integrated and handled in the codebase

The verification shows that InstanceStateSuspended is well-integrated:

  • Used in engine controller for state transitions during live upgrades
  • Properly handled in volume controller for suspending/resuming engines
  • Integrated in instance handler with appropriate state management and validation
  • Specifically restricted to v2 data engine instances
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for handlers of the new suspended state
rg "InstanceStateSuspended" --type go -B 2 -A 2

Length of output: 3876

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/nodedataengineupgrade.go (3)

1-31: LGTM: Standard k8s client-gen setup

The file header, license, and imports follow the standard Kubernetes client-gen patterns.


53-65: LGTM: Standard client implementation structure

The implementation struct and constructor follow the standard Kubernetes client patterns with proper namespace scoping.


67-195: LGTM: Complete CRUD implementation with proper error handling

The implementation includes all necessary CRUD operations with proper context handling, timeouts, and error management. Each method follows standard Kubernetes client patterns.

Let's verify the resource definition matches the client implementation:

✅ Verification successful

LGTM: Client implementation correctly matches CRD and API type definitions

The client implementation is verified to be correct with:

  • API type properly defines Status field in the struct
  • Status subresource is enabled in the CRD (confirmed by subresources presence)
  • All CRUD operations including UpdateStatus are properly implemented
  • Proper error handling and context propagation is in place
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify CRD definition matches client implementation
# Check for CRD definition and status subresource
rg -A 5 "kind: CustomResourceDefinition.*NodeDataEngineUpgrade" 

Length of output: 65


Script:

#!/bin/bash
# Let's try a broader search for CRD definitions and NodeDataEngineUpgrade
# First, search for any yaml/json files containing NodeDataEngineUpgrade
fd -e yaml -e yml -e json | xargs rg "NodeDataEngineUpgrade" -l

# Then, let's look for the API type definition
ast-grep --pattern 'type NodeDataEngineUpgrade struct {
  $$$
}'

# Also search for the resource registration
rg "nodedataengineupgrades" -A 5 -B 5

Length of output: 31082


Script:

#!/bin/bash
# Let's check the CRD definition in crds.yaml for status subresource and schema
rg -A 20 "kind: CustomResourceDefinition.*NodeDataEngineUpgrade" k8s/crds.yaml

# Also verify the API type structure
ast-grep --pattern 'type NodeDataEngineUpgradeStatus struct {
  $$$
}'

Length of output: 992


Script:

#!/bin/bash
# Let's check the CRD definition specifically for status subresource
rg -A 30 "name: nodedataengineupgrades.longhorn.io" k8s/crds.yaml

# Also check for any webhook configurations
rg -A 10 "NodeDataEngineUpgrade.*webhook" 

Length of output: 1010


Script:

#!/bin/bash
# Let's check if the CRD has status subresource enabled
rg -A 50 "name: v1beta2" k8s/crds.yaml | rg "subresources|served|storage"

# Also check the complete type definition
ast-grep --pattern 'type NodeDataEngineUpgrade struct {
  $$$
}'

Length of output: 2473

k8s/pkg/client/informers/externalversions/longhorn/v1beta2/interface.go (2)

41-42: LGTM! Interface additions follow consistent patterns.

The new informer methods DataEngineUpgradeManagers() and NodeDataEngineUpgrades() are correctly added to the Interface type, following the established naming conventions and return types.

Also applies to: 51-52


123-126: LGTM! Implementation follows standard informer patterns.

The implementations of both informer methods follow the consistent pattern used throughout the file:

  • Proper initialization of informer structs
  • Consistent usage of factory, namespace, and tweakListOptions

Let's verify the existence of the corresponding informer types:

Also applies to: 148-151

✅ Verification successful

LGTM! The informer types are properly defined and follow the standard pattern

The verification confirms that both informer types are correctly defined in their respective files:

  • dataEngineUpgradeManagerInformer in dataengineupgrademanager.go
  • nodeDataEngineUpgradeInformer in nodedataengineupgrade.go

This matches the standard informer pattern where each type has its dedicated file and proper struct definition.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the corresponding informer types exist
# Expected: Find declarations of dataEngineUpgradeManagerInformer and nodeDataEngineUpgradeInformer types

rg -g '*.go' 'type (dataEngineUpgradeManagerInformer|nodeDataEngineUpgradeInformer) struct'

Length of output: 355

k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go (3)

1-31: LGTM: Package structure and imports are well-organized

The file header, license, and imports are properly structured following Kubernetes code generation standards.


53-65: LGTM: Clean implementation of client struct and constructor

The implementation follows standard Kubernetes client patterns with proper encapsulation of the REST client and namespace.


67-195: LGTM with verification: Comprehensive CRUD implementation with proper timeout handling

The implementation follows Kubernetes client patterns with proper timeout handling, error handling, and subresource management. Let's verify the timeout handling implementation matches other core Kubernetes clients.

✅ Verification successful

Based on the search results and previous outputs, I can now provide a final response since we have verified:

  1. The List method implementation with proper timeout handling across multiple resources
  2. The Watch method implementation with consistent timeout handling
  3. The DeleteCollection method implementation with proper timeout handling

Verified: Timeout handling implementation is consistent with Kubernetes client patterns

The timeout handling in DataEngineUpgradeManager's CRUD operations follows the standard Kubernetes client patterns:

  • List/Watch operations properly extract timeout from ListOptions
  • DeleteCollection correctly handles timeout from ListOptions
  • All operations consistently use time.Duration for timeout conversion
  • Timeout is properly applied to the request chain via client.Get()/Delete().Timeout()
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Compare timeout handling with other Kubernetes client implementations
# Expected: Similar timeout handling patterns in List, Watch, and DeleteCollection operations

# Search for similar timeout handling patterns in other client implementations
rg -A 3 "var timeout time.Duration.*if opts.TimeoutSeconds"

Length of output: 61


Script:

#!/bin/bash
# Let's try a different approach to find timeout handling patterns
# First, let's look for List methods with timeout handling
ast-grep --pattern 'func $_($_, $_, opts v1.ListOptions) $_ {
  var timeout time.Duration
  if opts.TimeoutSeconds != nil {
    $$$
  }
  $$$
}'

# Also search for similar patterns in Watch and DeleteCollection
rg -U "func.*Watch.*Context.*ListOptions.*\{(\n.*?)*?timeout.*:=.*opts\.TimeoutSeconds.*\*.*Second" 

# And check for any other timeout handling patterns
rg -U "Timeout\(timeout\).*Do\(ctx\)"

Length of output: 288


Script:

#!/bin/bash
# Let's try a simpler approach to find timeout handling in Kubernetes clients
# Search for List method implementations
rg -A 10 "func.*List.*Context.*ListOptions.*\{" k8s/pkg/client/

# Search for Watch method implementations
rg -A 10 "func.*Watch.*Context.*ListOptions.*\{" k8s/pkg/client/

# Search for DeleteCollection method implementations
rg -A 10 "func.*DeleteCollection.*Context.*DeleteOptions.*ListOptions.*\{" k8s/pkg/client/

Length of output: 116493


Script:

#!/bin/bash
# Let's check for DeleteCollection implementations
rg -A 5 "func.*DeleteCollection.*Context.*DeleteOptions.*ListOptions.*\{" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/dataengineupgrademanager.go

# And also check for DeleteCollection timeout handling in other files
rg -A 5 "DeleteCollection.*timeout = time.Duration" k8s/pkg/client/clientset/versioned/typed/longhorn/v1beta2/

Length of output: 564

k8s/pkg/client/informers/externalversions/generic.go (2)

Line range hint 1-1: ⚠️ This is a generated file

This file is auto-generated by informer-gen. Any direct modifications will be overwritten. Ensure that changes are made to the appropriate generation sources instead.

Let's verify the generation sources:

✅ Verification successful

Let me verify the CRD definitions for the new resources.


✅ Generated code is properly sourced from type definitions

The changes in the generated informer code are correctly sourced from properly defined types:

  • Both new resources have the required generation markers (+genclient, +k8s:deepcopy-gen)
  • Types are defined in k8s/pkg/apis/longhorn/v1beta2/ with proper kubebuilder annotations
  • CRDs exist in k8s/crds.yaml
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of resource definitions and generation markers

# Check for CRD definitions of the new resources
fd -e yaml -e yml . | xargs grep -l "kind: CustomResourceDefinition" | xargs grep -l "dataengineupgrademanagers\|nodedataengineupgrades"

# Check for generation tags/markers in Go types
fd -e go . | xargs grep -l "// +genclient\|// +k8s:deepcopy-gen"

Length of output: 12249


Script:

#!/bin/bash
# Check the CRD definitions in k8s/crds.yaml for the new resources
cat k8s/crds.yaml | grep -A 5 "kind: CustomResourceDefinition" | grep -A 20 "dataengineupgrademanagers\|nodedataengineupgrades"

# Check for type definitions with generation markers
grep -A 5 "// +genclient" k8s/pkg/apis/longhorn/v1beta2/dataengineupgrademanager.go k8s/pkg/apis/longhorn/v1beta2/nodedataengineupgrade.go

Length of output: 1623


103-104: LGTM: New informer cases properly integrated

The addition of dataengineupgrademanagers and nodedataengineupgrades resources follows the established pattern and aligns with the PR's objective of implementing v2 data engine live upgrades.

Let's verify the integration of these resources:

Also applies to: 113-114

✅ Verification successful

Let me gather more information to verify the complete integration by checking for the controller implementations and related types.


Let me check one final aspect - the types registration and validation rules to ensure complete integration.


Integration verification complete - resources properly integrated

The verification confirms proper integration of DataEngineUpgradeManager and NodeDataEngineUpgrade resources:

  • Types are defined with complete struct definitions including specs and status
  • Resources are registered in types.go with appropriate kinds and labels
  • Informers and listers are properly implemented
  • CRD definitions exist in k8s/crds.yaml
  • Helper functions for name generation and label management are implemented
  • Scheme registration is in place
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the complete integration of new resources

# Check for corresponding controller implementations
ast-grep --pattern 'type $_(Controller|Validator) struct {
  $$$
}'

# Check for informer/lister implementations
fd -e go . | xargs grep -l "DataEngineUpgradeManager\|NodeDataEngineUpgrade" | grep -i "informer\|lister"

Length of output: 669


Script:

#!/bin/bash
# Check for controller and type definitions
ast-grep --pattern 'type DataEngineUpgradeManager struct {
  $$$
}'

ast-grep --pattern 'type NodeDataEngineUpgrade struct {
  $$$
}'

# Check for controller implementations
rg "type.*Controller.*struct.*DataEngineUpgradeManager" -A 5
rg "type.*Controller.*struct.*NodeDataEngineUpgrade" -A 5

# Check for CRD types and validation
fd -e go . | xargs grep -l "DataEngineUpgradeManager\|NodeDataEngineUpgrade" | grep -i "types\|validation"

Length of output: 1713


Script:

#!/bin/bash
# Check types registration and validation
rg "DataEngineUpgradeManager|NodeDataEngineUpgrade" types/types.go -A 5 -B 5

# Check for scheme registration
rg "AddToScheme" k8s/pkg/apis/longhorn/v1beta2/

# Verify CRD validation rules
fd -e yaml . | xargs grep -l "DataEngineUpgradeManager\|NodeDataEngineUpgrade"

Length of output: 2411

controller/upgrade_manager_controller.go (4)

146-156: LGTM! Well-structured helper functions

The helper functions are well-implemented with proper logging and responsibility checks.


262-275: ⚠️ Potential issue

Add mutex protection to monitor operations

The monitor creation and access need to be protected by a mutex to prevent race conditions in concurrent operations.

Add mutex protection to monitor operations:

 func (uc *DataEngineUpgradeManagerController) createDataEngineUpgradeManagerMonitor(upgradeManager *longhorn.DataEngineUpgradeManager) (monitor.Monitor, error) {
+    uc.monitorLock.Lock()
+    defer uc.monitorLock.Unlock()
     if uc.dataEngineUpgradeManagerMonitor != nil {
         return uc.dataEngineUpgradeManagerMonitor, nil
     }
     // ... rest of the function
 }

Likely invalid or redundant comment.


29-45: ⚠️ Potential issue

Add mutex protection for concurrent access to shared resources

The dataEngineUpgradeManagerMonitor field is accessed from multiple goroutines without synchronization, which could lead to race conditions.

Add mutex protection:

 type DataEngineUpgradeManagerController struct {
     *baseController
+    monitorLock sync.RWMutex
     // ... other fields ...
     dataEngineUpgradeManagerMonitor monitor.Monitor
     // ... remaining fields ...
 }

Likely invalid or redundant comment.


235-238: ⚠️ Potential issue

Handle type assertion failure appropriately

The type assertion failure is logged but the function continues execution, which could lead to inconsistent state.

Return an error when the type assertion fails:

 status, ok := data.(*longhorn.DataEngineUpgradeManagerStatus)
 if !ok {
     log.Errorf("Failed to assert value from dataEngineUpgradeManager monitor: %v", data)
+    return fmt.Errorf("failed to assert monitor data to DataEngineUpgradeManagerStatus")
 }

Likely invalid or redundant comment.

controller/monitor/node_upgrade_monitor.go (2)

293-293: ⚠️ Potential issue

Resource leak: Close engineClientProxy after each iteration.

Deferring engineClientProxy.Close() inside a loop causes all closures to be postponed until the function returns, potentially leading to resource exhaustion.


174-178: ⚠️ Potential issue

Variable shadowing in deferred functions.

Multiple deferred functions in this file redefine the err variable, which can lead to confusion and potential bugs.

Also applies to: 192-206

engineapi/instance_manager.go (3)

6-6: LGTM!

The addition of the "net" package is appropriate for the new host:port parsing functionality.


283-292: LGTM!

The addition of standby target port fields enhances the instance process status to support v2 data engine live upgrades.


488-491: LGTM!

The integration of getReplicaAddresses in EngineInstanceCreate is well-implemented with proper error handling.

controller/backup_controller.go (1)

599-607: LGTM! Verify handling of ongoing backups during upgrades.

The addition of the data engine upgrade check is correct and aligns with the PR objective of enabling live upgrades. When a node has a data engine upgrade requested, the backup controller correctly relinquishes responsibility.

Let's verify if there are any ongoing backup operations that might be affected:

✅ Verification successful

Backup operations are safely handled during data engine upgrades

The code analysis reveals that the backup controller already has robust mechanisms to handle backup state transitions:

  1. The backup monitor system (BackupMonitor) properly tracks backup states and can gracefully handle interruptions:

    • Uses exponential backoff for pending states
    • Properly transitions to error states when needed
    • Has clean shutdown mechanisms via disableBackupMonitor
  2. When the controller relinquishes responsibility (returns false), any ongoing backups will:

    • Maintain their state through the backup monitor
    • Either complete normally or transition to an error state
    • Be properly cleaned up via the existing monitor shutdown mechanisms
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any ongoing backup operations that might be affected by the upgrade
# Test: Search for backups in progress. Expect: List of backups in non-final states

# Find backup resources in non-final states
ast-grep --pattern 'backup.Status.State == $state' | rg -A 2 'BackupState(New|Pending|InProgress)'

# Find any backup monitor related code that might need adjustment
rg -A 5 'BackupMonitor'

Length of output: 14784

controller/instance_handler.go (4)

38-46: LGTM: Interface changes align with live upgrade requirements

The new interface methods provide a comprehensive set of operations needed for V2 data engine live upgrades, including instance suspension, resumption, and target management.


919-939: LGTM: Implementation aligns with project conventions

The implementation correctly follows the project convention regarding NewInstanceManagerClient usage without context parameter, as noted in the learnings.


809-839: LGTM: Clean implementation of suspend/resume functionality

The implementation provides clear error handling, proper validation, and descriptive logging for instance suspension and resumption operations.


883-995: LGTM: Well-structured helper functions

The helper functions are well-named, focused, and improve code readability. They effectively encapsulate specific checks and conditions related to instance and volume states.

types/types.go (2)

43-44: LGTM: Constants follow naming conventions and align with feature requirements.

The new constants for data engine upgrade management are well-organized and follow the project's naming conventions.

Also applies to: 163-164, 193-193


1275-1291: LGTM: Label getter functions are well-implemented.

The label getter functions follow the project's patterns and correctly implement the labeling requirements for data engine upgrade management.

k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4)

981-1092: LGTM: DataEngineUpgradeManager deepcopy implementation

The implementation correctly handles all fields including nested maps and pointers, following Kubernetes deepcopy patterns.


1783-1890: LGTM: NodeDataEngineUpgrade deepcopy implementation

The implementation correctly handles the map of VolumeUpgradeStatus pointers with proper nil checks.


2991-3006: Previous review comment about UpgradeNodeStatus implementation is incorrect

The current implementation is correct since UpgradeNodeStatus contains only primitive fields that can be safely copied with a simple assignment.


3326-3341: Previous review comment about VolumeUpgradeStatus implementation is incorrect

The current implementation is correct since VolumeUpgradeStatus contains only primitive fields that can be safely copied with a simple assignment.

controller/engine_controller.go (1)

437-467: LGTM! Well-structured instance manager resolution.

The new findInstanceManagerAndIPs method provides a clean separation of concerns for resolving instance managers and their IPs, with proper error handling and clear logic for both initiator and target instances.

k8s/crds.yaml (1)

1791-1793: LGTM: Consistent implementation of target node fields

The addition of targetNodeID fields in Engine and Replica CRDs is well-implemented with clear descriptions and consistent types.

Also applies to: 3244-3246

datastore/longhorn.go (2)

3784-3841: Well-structured helper functions with good separation of concerns

The extracted helper functions listInstanceManagers() and filterInstanceManagers() follow good practices:

  • Clear single responsibility
  • Proper error handling
  • Good documentation

5641-5763: Data engine upgrade manager implementation follows best practices

The DataEngineUpgradeManager CRUD operations are well-implemented with:

  • Proper verification of creation/updates
  • Consistent error handling
  • Appropriate finalizer management

Comment on lines +61 to +65
if err := wait.PollUntilContextCancel(m.ctx, m.syncPeriod, true, func(context.Context) (bool, error) {
if err := m.run(struct{}{}); err != nil {
m.logger.WithError(err).Error("Stopped monitoring upgrade monitor")
}
return false, nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle Errors from m.run() Appropriately in the Polling Loop

In the Start() method, errors returned by m.run() are logged but the loop continues without handling the error. Consider adjusting the error handling to determine whether to stop the loop or take corrective action when m.run() returns an error to prevent potential cascading failures.

return
}

// TODO: Check if there is any nodeDataEngineUpgrade in progress but not tracked by m.upgradeManagerStatus.UpgradingNode
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Implement TODO: Check for Untracked NodeDataEngineUpgrade Processes

There is a TODO comment indicating the need to check if there are any NodeDataEngineUpgrade processes in progress but not tracked by m.upgradeManagerStatus.UpgradingNode. Implementing this check is important to ensure all upgrade processes are properly monitored and managed.

Would you like assistance in implementing this functionality or opening a GitHub issue to track this task?

controller/upgrade_manager_controller.go Show resolved Hide resolved
Comment on lines +861 to +887
func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) bool {
if !types.IsDataEngineV2(spec.DataEngine) {
return false
}

upgradeRequested, err := h.ds.IsNodeDataEngineUpgradeRequested(spec.NodeID)
if err != nil {
logrus.WithError(err).Errorf("Failed to get node %v", spec.NodeID)
return false
}

if !upgradeRequested {
return false
}

if spec.TargetNodeID == "" {
return false
}

return spec.NodeID != spec.TargetNodeID && spec.TargetNodeID == status.CurrentTargetNodeID
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve error handling in isV2DataEngineBeingUpgraded

The function currently logs and swallows errors when checking node upgrade status. This could mask issues and lead to incorrect upgrade decisions.

Consider propagating the error:

-func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) bool {
+func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) (bool, error) {
     if !types.IsDataEngineV2(spec.DataEngine) {
-        return false
+        return false, nil
     }
 
     upgradeRequested, err := h.ds.IsNodeDataEngineUpgradeRequested(spec.NodeID)
     if err != nil {
-        logrus.WithError(err).Errorf("Failed to get node %v", spec.NodeID)
-        return false
+        return false, fmt.Errorf("failed to check node upgrade status: %v", err)
     }

Committable suggestion skipped: line range outside the PR's diff.

controller/engine_controller.go Outdated Show resolved Hide resolved
Comment on lines +1041 to +1081
err = m.refresh(engine)
if err != nil && apierrors.IsConflict(errors.Cause(err)) {
// Retry if the error is due to conflict
continue
}
if types.IsDataEngineV2(engine.Spec.DataEngine) && err != nil && apierrors.IsNotFound(errors.Cause(err)) {
upgrading, upgradingCheckErr := m.ds.IsNodeDataEngineUpgradeRequested(engine.Spec.NodeID)
if upgrading {
updated, updatedCheckErr := m.isInstanceManagerUpdated(engine)
if updated {
for replicaName := range engine.Status.ReplicaModeMap {
replica, replicaErr := m.ds.GetReplicaRO(replicaName)
if replicaErr != nil {
m.logger.WithError(replicaErr).Errorf("Failed to get replica %v", replicaName)
continue
}
if replica.Spec.NodeID == engine.Spec.NodeID &&
replica.Status.CurrentState != longhorn.InstanceStateError &&
replica.Status.CurrentState != longhorn.InstanceStateStopped {
m.logger.Warnf("Replica %v in state %v is still on the node %v, will retry updating status later",
replicaName, replica.Status.CurrentState, replica.Spec.NodeID)
return false
}
}
} else {
m.logger.Warnf("v2 data engine %v is being upgraded, will retry updating status later", engine.Name)
return false
}
if updatedCheckErr != nil {
upgradingCheckErr = errors.Wrapf(updatedCheckErr, "failed to check if the instance manager is updated")
}
}
if upgradingCheckErr != nil {
err = errors.Wrapf(upgradingCheckErr, "failed to check if the engine %v is being upgraded", engine.Name)
}
}
// Retry if the error is due to conflict
utilruntime.HandleError(errors.Wrapf(err, "failed to update status for engine %v", m.Name))
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add backoff mechanism and improve error handling in monitor sync.

The error handling in the monitor sync could benefit from a backoff mechanism to prevent rapid retries, and additional logging would improve observability.

Consider these improvements:

  1. Add backoff for retries:
 if types.IsDataEngineV2(engine.Spec.DataEngine) && err != nil && apierrors.IsNotFound(errors.Cause(err)) {
+    if m.backoff.IsInBackOffSinceUpdate(engine.Name, time.Now()) {
+        m.logger.Debugf("Engine %v is in backoff period, skipping sync", engine.Name)
+        return false
+    }
     upgrading, upgradingCheckErr := m.ds.IsNodeDataEngineUpgradeRequested(engine.Spec.NodeID)
     if upgrading {
         updated, updatedCheckErr := m.isInstanceManagerUpdated(engine)
         if updated {
             // ... existing code ...
         } else {
             m.logger.Warnf("v2 data engine %v is being upgraded, will retry updating status later", engine.Name)
+            m.backoff.Next(engine.Name, time.Now())
             return false
         }
     }
 }
  1. Add structured logging:
+m.logger.WithFields(logrus.Fields{
+    "engine": engine.Name,
+    "nodeID": engine.Spec.NodeID,
+}).Debug("Starting monitor sync")

Committable suggestion skipped: line range outside the PR's diff.

k8s/crds.yaml Show resolved Hide resolved
Comment on lines +3208 to +3214
} else {
// TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle engine state error during v2 data engine live upgrade

The TODO comment indicates missing error handling for non-running engine state during v2 data engine live upgrade.

Apply this fix to handle error states:

} else {
-    // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
+    if e.Status.CurrentState == longhorn.InstanceStateError {
+        log.Errorf("Engine entered error state during v2 data engine live upgrade")
+        return fmt.Errorf("engine in error state during live upgrade")
+    }
+    if e.Status.CurrentState != longhorn.InstanceStateRunning {
+        log.Debugf("Engine is in %v state, waiting for running state", e.Status.CurrentState)
+    }
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else {
// TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
}
} else {
if e.Status.CurrentState == longhorn.InstanceStateError {
log.Errorf("Engine entered error state during v2 data engine live upgrade")
return fmt.Errorf("engine in error state during live upgrade")
}
if e.Status.CurrentState != longhorn.InstanceStateRunning {
log.Debugf("Engine is in %v state, waiting for running state", e.Status.CurrentState)
}
}
🧰 Tools
🪛 golangci-lint (1.62.2)

3208-3208: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3208-3208: controller/volume_controller.go#L3208
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

…equested data engine upgrade

Longhorn 9104

Signed-off-by: Derek Su <[email protected]>
…tor and target IPs

If initiatorIP is not equal to targetIP, exclude the replica on the node
where the initiator is running, because the node is going to be upgraded
and the instance-manager will be deleted.

Longhorn 9104

Signed-off-by: Derek Su <[email protected]>
Update replica controller codes due to the changes in APIs.

Longhorn 9104

Signed-off-by: Derek Su <[email protected]>
Update replica scheduler codes due to the changes in APIs.

Longhorn 9104

Signed-off-by: Derek Su <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (19)
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (5)

95-97: Minor: Improve field documentation grammar

The documentation for TargetNodeID has a minor grammatical issue.

-// TargetNodeID is specifies the node where the volume's initiator instance should be scheduled during data engine live upgrade.
+// TargetNodeID specifies the node where the volume's initiator instance should be scheduled during data engine live upgrade.

Line range hint 116-133: Add documentation for new target-related fields

The new target-related fields lack documentation comments explaining their purpose and usage during the live upgrade process.

Consider adding documentation like:

+// TargetIP represents the IP address of the target instance during live upgrade
 // +optional
 TargetIP string `json:"targetIP"`
+// StorageTargetIP represents the storage IP address of the target instance during live upgrade
 // +optional
 StorageTargetIP string `json:"storageTargetIP"`
+// TargetPort represents the port number of the target instance during live upgrade
 // +optional
 TargetPort int `json:"targetPort"`
+// CurrentTargetNodeID represents the node ID where the target instance is currently running
 // +optional
 CurrentTargetNodeID string `json:"currentTargetNodeID"`
+// TargetInstanceReplacementCreated indicates whether a replacement target instance has been created
 // +optional
 TargetInstanceReplacementCreated bool `json:"targetInstanceReplacementCreated"`

155-157: Add documentation for standby port fields

The new standby port fields lack documentation comments explaining their purpose and usage.

Consider adding documentation like:

+// StandbyTargetPortEnd represents the end of the port range reserved for standby target operations
 // +optional
 StandbyTargetPortEnd int32 `json:"standbyTargetPortEnd"`
+// StandbyTargetPortStart represents the start of the port range reserved for standby target operations
 // +optional
 StandbyTargetPortStart int32 `json:"standbyTargetPortStart"`

168-171: Add documentation for NvmeSubsystem type

The NvmeSubsystem type and its fields lack documentation comments explaining their purpose and usage.

Consider adding documentation like:

+// NvmeSubsystem represents an NVMe subsystem configuration
 type NvmeSubsystem struct {
+    // Paths is a map of NVMe device paths indexed by their unique identifiers
     // +optional
     Paths map[string]NvmeDevicePath `json:"paths"`
 }

173-183: Enhance NvmeDevicePath type documentation and validation

  1. Add documentation for the type and its fields.
  2. Consider using an enum type for the State field to ensure valid values.
+// NvmeDevicePath represents the configuration of an NVMe device path
 type NvmeDevicePath struct {
+    // Trtype specifies the NVMe transport type
     // +optional
     Trtype string `json:"trtype"`
+    // Traddr specifies the transport address
     // +optional
     Traddr string `json:"traddr"`
+    // Trsvcid specifies the transport service ID
     // +optional
     Trsvcid string `json:"trsvcid"`
+    // SrcAddr specifies the source address
     // +optional
     SrcAddr string `json:"srcAddr"`
+    // State represents the current state of the NVMe device path
     // +optional
+    // +kubebuilder:validation:Enum=unknown;active;inactive;error
     State string `json:"state"`
 }
controller/upgrade_manager_controller.go (1)

57-60: Address TODO comment about event sink wrapper

The TODO comment indicates that the event sink wrapper should be removed when all clients have moved to use the clientset. This should be tracked and addressed in future updates.

Would you like me to create a GitHub issue to track this technical debt?

controller/node_upgrade_controller.go (2)

57-60: Address or remove the TODO comment

The TODO comment about removing the wrapper seems outdated as it refers to clients moving to use the clientset. Consider either implementing the suggested change or removing the comment if it's no longer relevant.


29-45: Consider adding architectural documentation

This controller implements a critical feature for live upgrades. Consider adding documentation that explains:

  • The upgrade process and state transitions
  • Interaction with other controllers
  • Recovery mechanisms for failed upgrades
  • Monitoring and observability aspects
controller/instance_handler_test.go (2)

75-94: Consider parameterizing mock implementations for better test coverage.

The new mock methods currently return static errors. Consider making them configurable to test different scenarios, similar to how GetInstance uses the instance name to determine behavior.

 type MockInstanceManagerHandler struct {
+    // Add fields to control mock behavior
+    shouldSuspendFail bool
+    shouldResumeFail bool
+    shouldSwitchOverFail bool
 }

 func (imh *MockInstanceManagerHandler) SuspendInstance(obj interface{}) error {
-    return fmt.Errorf("SuspendInstance is not mocked")
+    if imh.shouldSuspendFail {
+        return fmt.Errorf("SuspendInstance failed")
+    }
+    return nil
 }

Line range hint 147-147: Add test cases for new functionality.

Consider adding test cases to TestReconcileInstanceState for:

  1. Remote target instance scenarios using the new isInstanceOnRemoteNode parameter
  2. Target IP/Port configuration validation
  3. Instance suspension and resumption state transitions
engineapi/instance_manager.go (2)

546-569: Enhance error messages with more context.

While the address validation and filtering logic is correct, the error messages could be more specific to help with debugging.

Consider applying this diff to improve error messages:

-		return nil, errors.New("invalid initiator address format")
+		return nil, fmt.Errorf("invalid initiator address format: %v", initiatorAddress)

-		return nil, errors.New("invalid target address format")
+		return nil, fmt.Errorf("invalid target address format: %v", targetAddress)

-			return nil, errors.New("invalid replica address format")
+			return nil, fmt.Errorf("invalid replica address format: %v", addr)

882-965: Add godoc comments for the new functions.

While the implementation is correct, adding detailed godoc comments would improve code documentation and maintainability.

Consider adding documentation for each function, for example:

// EngineInstanceSuspend suspends an engine instance.
// It returns an error if the engine is nil or if the operation is not supported for the data engine type.
// For DataEngineTypeV1, this operation is not supported.
// For DataEngineTypeV2, it delegates to the instance service client's InstanceSuspend method.
func (c *InstanceManagerClient) EngineInstanceSuspend(req *EngineInstanceSuspendRequest) error {

Apply similar documentation patterns to the other new functions:

  • EngineInstanceResume
  • EngineInstanceSwitchOverTarget
  • EngineInstanceDeleteTarget
controller/instance_handler.go (2)

58-165: Enhance error handling and logging in syncStatusIPsAndPorts.

While the implementation is solid, there are opportunities for improvement:

  1. Consider consolidating similar error handling patterns:
-if err != nil {
-    logrus.WithError(err).Errorf("Failed to get instance manager pod from %v", im.Name)
-    return
-}
-if imPod == nil {
-    logrus.Warnf("Instance manager pod from %v not exist in datastore", im.Name)
-    return
-}
+if err != nil || imPod == nil {
+    if err != nil {
+        logrus.WithError(err).Errorf("Failed to get instance manager pod from %v", im.Name)
+    } else {
+        logrus.Warnf("Instance manager pod from %v not exist in datastore", im.Name)
+    }
+    return
+}
  1. Add structured logging fields for better debugging:
-logrus.Infof("Instance %v starts running, Storage IP %v", instanceName, status.StorageIP)
+logrus.WithFields(logrus.Fields{
+    "instance": instanceName,
+    "storageIP": status.StorageIP,
+}).Info("Instance starts running")

867-1001: LGTM! Well-structured helper functions with one suggestion.

The helper functions are well-organized and follow good practices. However, the error handling in isV2DataEngineBeingUpgraded could be improved:

Consider propagating the error instead of swallowing it:

-func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) bool {
+func (h *InstanceHandler) isV2DataEngineBeingUpgraded(spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus) (bool, error) {
     if !types.IsDataEngineV2(spec.DataEngine) {
-        return false
+        return false, nil
     }

     upgradeRequested, err := h.ds.IsNodeDataEngineUpgradeRequested(spec.NodeID)
     if err != nil {
-        logrus.WithError(err).Errorf("Failed to get node %v", spec.NodeID)
-        return false
+        return false, fmt.Errorf("failed to check node upgrade status: %v", err)
     }

     if !upgradeRequested {
-        return false
+        return false, nil
     }

     if spec.TargetNodeID == "" {
-        return false
+        return false, nil
     }

-    return spec.NodeID != spec.TargetNodeID && spec.TargetNodeID == status.CurrentTargetNodeID
+    return spec.NodeID != spec.TargetNodeID && spec.TargetNodeID == status.CurrentTargetNodeID, nil
 }

This change would require updating all callers to handle the error appropriately.

controller/engine_controller.go (4)

437-467: Consider adding debug logs for instance manager resolution.

The method logic is sound, but could benefit from debug logs to aid troubleshooting, especially for the remote target instance path.

 func (ec *EngineController) findInstanceManagerAndIPs(obj interface{}) (im *longhorn.InstanceManager, initiatorIP string, targetIP string, err error) {
+    log := ec.logger.WithField("engine", e.Name)
     e, ok := obj.(*longhorn.Engine)
     if !ok {
         return nil, "", "", fmt.Errorf("invalid object for engine: %v", obj)
     }
 
     initiatorIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, false)
     if err != nil {
         return nil, "", "", err
     }
+    log.Debugf("Found initiator instance manager %v with IP %v", initiatorIM.Name, initiatorIM.Status.IP)
 
     initiatorIP = initiatorIM.Status.IP
     targetIP = initiatorIM.Status.IP
     im = initiatorIM
 
     if e.Spec.TargetNodeID != "" {
         targetIM, err := ec.ds.GetInstanceManagerByInstanceRO(obj, true)
         if err != nil {
             return nil, "", "", err
         }
+        log.Debugf("Found target instance manager %v with IP %v", targetIM.Name, targetIM.Status.IP)
 
         targetIP = targetIM.Status.IP
 
         if !e.Status.TargetInstanceReplacementCreated && e.Status.CurrentTargetNodeID == "" {
             im = targetIM
         }
     }
 
     return im, initiatorIP, targetIP, nil
 }

634-644: Enhance error messages for target deletion.

While the error handling is correct, the error messages could be more specific to help with troubleshooting.

 if e.Status.CurrentTargetNodeID != "" {
     err = c.EngineInstanceDeleteTarget(&engineapi.EngineInstanceDeleteTargetRequest{
         Engine: e,
     })
     if err != nil {
         if !types.ErrorIsNotFound(err) {
-            return errors.Wrapf(err, "failed to delete target for engine %v", e.Name)
+            return errors.Wrapf(err, "failed to delete target for engine %v on node %v", e.Name, e.Status.CurrentTargetNodeID)
         }
-        log.WithError(err).Warnf("Failed to delete target for engine %v", e.Name)
+        log.WithError(err).Warnf("Target not found while deleting for engine %v on node %v", e.Name, e.Status.CurrentTargetNodeID)
     }
 }

Line range hint 2422-2479: Consider adding intermediate status updates during upgrade.

While the upgrade logic is solid, adding intermediate status updates would improve observability.

 if types.IsDataEngineV2(e.Spec.DataEngine) {
+    log.Info("Starting v2 data engine upgrade validation")
     // Check if the initiator instance is running
     im, err := ec.ds.GetRunningInstanceManagerByNodeRO(e.Spec.NodeID, longhorn.DataEngineTypeV2)
     if err != nil {
         return err
     }
     if im.Status.CurrentState != longhorn.InstanceManagerStateRunning {
         return fmt.Errorf("instance manager %v for initiating instance %v is not running", im.Name, e.Name)
     }
+    log.Info("Initiator instance manager validation successful")

     initiatorIMClient, err := engineapi.NewInstanceManagerClient(im, false)
     if err != nil {
         return err
     }
     defer initiatorIMClient.Close()
+    log.Info("Starting target instance manager validation")

Line range hint 2548-2613: Add detailed documentation for ownership decision logic.

The ownership decision logic is complex and would benefit from detailed documentation explaining the different scenarios.

 func (ec *EngineController) isResponsibleFor(e *longhorn.Engine) (bool, error) {
+    // This method determines engine ownership based on several factors:
+    // 1. Node delinquency status
+    // 2. Data engine availability on nodes
+    // 3. Share manager ownership for RWX volumes
+    // 4. Current engine image availability
+    //
+    // The decision flow is:
+    // 1. Check if current owner node is delinquent
+    // 2. For RWX volumes, prefer share manager owner node
+    // 3. Verify data engine availability on preferred nodes
+    // 4. Select new owner based on availability and responsibility
controller/volume_controller.go (1)

1007-1012: Improve v2 data engine replica image handling

The code correctly handles the case where v2 volume replica images can be different from the volume image. However, the warning message could be more descriptive.

-                log.WithField("replica", r.Name).Warnf("Replica engine image %v is different from volume engine image %v, "+
-                    "but replica spec.Active has been set", r.Spec.Image, v.Spec.Image)
+                log.WithField("replica", r.Name).Warnf("V1 volume replica %v has mismatched engine image %v (volume image: %v) "+
+                    "and spec.Active is unexpectedly set", r.Name, r.Spec.Image, v.Spec.Image)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between eda6a7b and 92c8741.

⛔ Files ignored due to path filters (20)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/api/instance.go is excluded by !vendor/**
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/client/instance.go is excluded by !vendor/**
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/util/util.go is excluded by !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/imrpc/instance.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_compare.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_format.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_forward.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_order.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertions.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/yaml/yaml_custom.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/yaml/yaml_default.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/yaml/yaml_fail.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go.tmpl is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require_forward.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/requirements.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (28)
  • controller/controller_manager.go (2 hunks)
  • controller/engine_controller.go (11 hunks)
  • controller/instance_handler.go (15 hunks)
  • controller/instance_handler_test.go (4 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • controller/monitor/upgrade_manager_monitor.go (1 hunks)
  • controller/node_controller.go (2 hunks)
  • controller/node_upgrade_controller.go (1 hunks)
  • controller/replica_controller.go (5 hunks)
  • controller/uninstall_controller.go (4 hunks)
  • controller/upgrade_manager_controller.go (1 hunks)
  • controller/utils.go (0 hunks)
  • controller/volume_controller.go (16 hunks)
  • controller/volume_controller_test.go (1 hunks)
  • engineapi/instance_manager.go (5 hunks)
  • engineapi/instance_manager_test.go (1 hunks)
  • k8s/crds.yaml (87 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (6 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (6 hunks)
  • scheduler/replica_scheduler.go (1 hunks)
  • types/types.go (4 hunks)
  • webhook/resources/dataengineupgrademanager/mutator.go (1 hunks)
  • webhook/resources/dataengineupgrademanager/validator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/mutator.go (1 hunks)
  • webhook/resources/nodedataengineupgrade/validator.go (1 hunks)
  • webhook/resources/volume/validator.go (5 hunks)
  • webhook/server/mutation.go (2 hunks)
  • webhook/server/validation.go (2 hunks)
💤 Files with no reviewable changes (1)
  • controller/utils.go
🚧 Files skipped from review as they are similar to previous changes (13)
  • controller/controller_manager.go
  • controller/monitor/node_upgrade_monitor.go
  • controller/monitor/upgrade_manager_monitor.go
  • controller/replica_controller.go
  • engineapi/instance_manager_test.go
  • scheduler/replica_scheduler.go
  • webhook/resources/dataengineupgrademanager/mutator.go
  • webhook/resources/dataengineupgrademanager/validator.go
  • webhook/resources/nodedataengineupgrade/mutator.go
  • webhook/resources/nodedataengineupgrade/validator.go
  • webhook/resources/volume/validator.go
  • webhook/server/mutation.go
  • webhook/server/validation.go
🧰 Additional context used
📓 Learnings (4)
controller/engine_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/engine_controller.go:524-527
Timestamp: 2024-11-25T12:39:58.926Z
Learning: In `controller/engine_controller.go`, `e.Status.Port` is sourced from the SPDK engine and does not require additional validation.
controller/instance_handler.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/instance_handler.go:919-939
Timestamp: 2024-11-25T23:56:53.252Z
Learning: In the Longhorn Manager Go code, the function `engineapi.NewInstanceManagerClient` does not accept a `context.Context` parameter. Therefore, we cannot pass `ctx` to it in functions like `isInstanceExist` in `controller/instance_handler.go`.
controller/node_upgrade_controller.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/node_upgrade_controller.go:135-139
Timestamp: 2024-11-26T00:03:56.311Z
Learning: In the Longhorn project, the constant `maxRetries` is defined in `base_controller.go` and should not be redefined in individual controller files.
controller/upgrade_manager_controller.go (2)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/upgrade_manager_controller.go:135-139
Timestamp: 2024-11-25T15:30:18.858Z
Learning: In the Longhorn Manager codebase, the variable `maxRetries` is defined in `base_controller.go`, and other controllers can use it without redefining it.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/node_upgrade_controller.go:135-139
Timestamp: 2024-11-26T00:03:56.311Z
Learning: In the Longhorn project, the constant `maxRetries` is defined in `base_controller.go` and should not be redefined in individual controller files.
🪛 golangci-lint (1.62.2)
controller/volume_controller.go

3212-3212: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor
controller/volume_controller.go

[notice] 3225-3225: controller/volume_controller.go#L3225
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3212-3212: controller/volume_controller.go#L3212
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

🔇 Additional comments (44)
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (1)

61-67: LGTM: New instance state for upgrade support

The addition of InstanceStateSuspended state is well-aligned with the live upgrade feature requirements and follows the existing naming conventions.

controller/upgrade_manager_controller.go (5)

29-45: Add mutex protection for concurrent access to monitor

The dataEngineUpgradeManagerMonitor field is accessed from multiple goroutines without synchronization, which could lead to race conditions.


146-156: LGTM! Well-structured helper functions

The helper functions are well-defined, focused, and properly implement their responsibilities.


202-209: 🛠️ Refactor suggestion

Add cleanup verification before finalizer removal

The finalizer is removed without verifying that all resources have been properly cleaned up. Consider adding verification steps.

 if !upgradeManager.DeletionTimestamp.IsZero() {
+    // Verify all resources are cleaned up
+    if err := uc.verifyCleanup(upgradeManager); err != nil {
+        return err
+    }
     if uc.dataEngineUpgradeManagerMonitor != nil {
         uc.dataEngineUpgradeManagerMonitor.Close()
         uc.dataEngineUpgradeManagerMonitor = nil
     }
     return uc.ds.RemoveFinalizerForDataEngineUpgradeManager(upgradeManager)
 }

Likely invalid or redundant comment.


238-241: ⚠️ Potential issue

Improve error handling for monitor data type assertion

When the type assertion fails, the code continues execution with potentially invalid data. Consider returning an error to trigger a retry.

 status, ok := data.(*longhorn.DataEngineUpgradeManagerStatus)
 if !ok {
     log.Errorf("Failed to assert value from dataEngineUpgradeManager monitor: %v", data)
+    return fmt.Errorf("failed to assert monitor data to DataEngineUpgradeManagerStatus")
 } else {

Likely invalid or redundant comment.


136-136: Verify handleReconcileErrorLogging function availability

Verify that the handleReconcileErrorLogging function is defined in base_controller.go or another accessible location.

✅ Verification successful

Function is correctly defined and accessible

The handleReconcileErrorLogging function is properly defined in controller/utils.go and is accessible to the upgrade manager controller. The function handles different error scenarios appropriately with corresponding log levels (Trace, Warn, Error).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for handleReconcileErrorLogging function definition
ast-grep --pattern 'func handleReconcileErrorLogging($_, $_, $_) {
  $$$
}'

Length of output: 643

controller/node_upgrade_controller.go (3)

146-158: LGTM: Well-structured helper functions

The helper functions are well-implemented with clear responsibilities and proper error handling.


251-274: LGTM: Robust status update implementation

The status update logic includes proper error handling, type assertions, and deep copying of status fields. The implementation is thorough and safe.


223-229: 🛠️ Refactor suggestion

Enhance error handling in status updates

The deferred status update only handles conflict errors. Other types of errors should be logged and potentially trigger a requeue.

Apply this diff to improve error handling:

if !reflect.DeepEqual(existingNodeUpgrade.Status, nodeUpgrade.Status) {
    if _, err := uc.ds.UpdateNodeDataEngineUpgradeStatus(nodeUpgrade); err != nil {
        if apierrors.IsConflict(errors.Cause(err)) {
            log.WithError(err).Debugf("Requeue %v due to conflict", upgradeName)
            uc.enqueueNodeDataEngineUpgrade(upgradeName)
+       } else {
+           log.WithError(err).Errorf("Failed to update status: %v", err)
+           // Still requeue to retry the status update
+           uc.enqueueNodeDataEngineUpgrade(upgradeName)
        }
    }
}

Likely invalid or redundant comment.

controller/instance_handler_test.go (3)

Line range hint 39-51: LGTM: Method signature updates for remote node support.

The changes to GetInstance and CreateInstance signatures properly accommodate the new isInstanceOnRemoteNode parameter, maintaining consistency with the interface changes.


136-140: LGTM: Engine struct field additions for target configuration.

The additions of TargetIP, StorageIP, StorageTargetIP, and TargetPort fields in the engine struct properly support the v2 data engine live upgrade feature.


100-102: 🛠️ Refactor suggestion

Add test coverage for RequireRemoteTargetInstance.

The RequireRemoteTargetInstance mock implementation always returns false, which might not cover all test scenarios.

engineapi/instance_manager.go (2)

277-288: LGTM: NVMe subsystem and standby port fields properly implemented.

The implementation correctly handles NVMe subsystem initialization and field mapping, with appropriate type conversions and proper initialization of the paths map. The addition of standby target port fields maintains consistency with existing port field patterns.

Also applies to: 304-305


878-880: LGTM: Well-structured request types.

The request structs are properly defined with clear, focused purposes and appropriate fields for each operation type.

Also applies to: 899-901, 920-923, 946-948

controller/instance_handler.go (1)

38-47: LGTM! Interface changes align with v2 data engine requirements.

The new methods and signature updates provide a comprehensive interface for managing v2 data engine instances, including suspension, resumption, and target management capabilities.

Let's verify the interface implementation consistency:

✅ Verification successful

Let me try a different approach to verify the implementations.


Let's try one more verification approach to check the implementations.


Interface implementation is complete and consistent across the codebase

The verification shows that all new methods in the InstanceManagerHandler interface are properly implemented in both engine and replica controllers:

  • engine_controller.go implements all methods including SuspendInstance, ResumeInstance, SwitchOverTarget, DeleteTarget, and RequireRemoteTargetInstance
  • replica_controller.go implements all the same methods
  • Test mocks in instance_handler_test.go cover all the new methods
  • The implementations follow consistent patterns and signatures
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if all implementations of InstanceManagerHandler include the new methods

# Search for struct types that implement InstanceManagerHandler
ast-grep --pattern 'type $name struct {
  $$$
}

func ($receiver *$name) GetInstance($$$) ($$$) {
  $$$
}'

# Search for the new method implementations
echo "Checking new method implementations..."
for method in "SuspendInstance" "ResumeInstance" "SwitchOverTarget" "DeleteTarget" "RequireRemoteTargetInstance"; do
  echo "Checking $method implementations..."
  ast-grep --pattern "func ($receiver *$_) $method($_) error"
done

Length of output: 2015


Script:

#!/bin/bash
# Search for the struct that implements InstanceManagerHandler
rg -A 1 "type.*struct" controller/instance_handler.go

# Search for all method implementations
echo "Checking method implementations..."
rg "^func \(\w+ \*\w+\) (GetInstance|CreateInstance|DeleteInstance|LogInstance|SuspendInstance|ResumeInstance|SwitchOverTarget|DeleteTarget|RequireRemoteTargetInstance)" controller/instance_handler.go

Length of output: 412


Script:

#!/bin/bash
# Use git grep to search for method implementations with more context
echo "Searching for method implementations in instance_handler.go..."
git grep -A 2 "^func (.*) \(.*\)" controller/instance_handler.go

# Also check if there are any other files implementing these methods
echo -e "\nSearching for implementations in other files..."
rg "func.*\((.*?)\s+\*.*?\)\s+(GetInstance|CreateInstance|DeleteInstance|LogInstance|SuspendInstance|ResumeInstance|SwitchOverTarget|DeleteTarget|RequireRemoteTargetInstance)" --type go

Length of output: 12795

types/types.go (2)

43-44: LGTM! Constants follow naming conventions.

The new constants for data engine upgrade management are well-organized and follow the established naming patterns.

Also applies to: 163-164, 193-193


1271-1291: LGTM! Well-structured utility functions.

The new utility functions are well-implemented and follow the established patterns:

  • GenerateNodeDataEngineUpgradeName: Properly generates unique names
  • GetDataEngineUpgradeManagerLabels: Correctly sets up system managed component labels
  • GetNodeDataEngineUpgradeLabels: Handles optional parameters appropriately
controller/uninstall_controller.go (4)

49-50: LGTM! Constants follow established naming patterns.

The new CRD constants for data engine upgrades are well-defined and consistent with existing naming conventions.


208-219: LGTM! Event handler registration is properly implemented.

The event handler registration for new data engine upgrade resources follows the established pattern and includes proper error handling and cache syncs.


653-666: LGTM! Resource deletion order is logical.

The new data engine upgrade resources are appropriately added to the deletion sequence in the deleteCRs function.


1190-1256: LGTM! Deletion methods are well-implemented.

The deletion methods for DataEngineUpgradeManager and NodeDataEngineUpgrade:

  • Follow consistent error handling patterns
  • Properly manage finalizers
  • Include appropriate logging
  • Handle resource not found cases correctly
controller/volume_controller_test.go (1)

506-510: Verify field definitions in the Engine struct

Based on the past review comments, these fields may not be properly defined in the Engine struct. Please ensure that these fields are properly defined in k8s/pkg/apis/longhorn/v1beta2/engine.go.

#!/bin/bash
# Description: Verify the field definitions in the Engine struct

# Search for the Engine struct definition and verify fields
ast-grep --pattern $'type EngineStatus struct {
  $$$
  TargetIP string
  StorageTargetIP string
  TargetPort int32
  $$$
}'

# Also check for any usage of these fields in the codebase
rg "TargetIP|StorageTargetIP|TargetPort" --type go
controller/node_controller.go (1)

Line range hint 2177-2209: LGTM! Clean implementation of data engine upgrade scheduling control.

The changes properly implement scheduling control for data engine upgrades by:

  1. Adding a new condition for data engine upgrade requests
  2. Maintaining the existing cordoned node condition logic
  3. Following the established pattern for setting node conditions

The implementation is clean, well-structured, and integrates seamlessly with the existing code.

k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4)

981-1092: LGTM: DataEngineUpgradeManager deepcopy implementation

The implementation correctly handles:

  • TypeMeta and ObjectMeta deepcopy
  • Proper nil checks
  • Deep copying of map values with proper type handling

1784-1891: LGTM: NodeDataEngineUpgrade deepcopy implementation

The implementation correctly handles:

  • TypeMeta and ObjectMeta deepcopy
  • Proper nil checks
  • Deep copying of map values with proper type handling

3031-3046: LGTM: UpgradeNodeStatus deepcopy implementation

The implementation correctly uses shallow copy since the struct contains only simple types.


3366-3381: LGTM: VolumeUpgradeStatus deepcopy implementation

The implementation correctly uses shallow copy since the struct contains only simple types.

controller/engine_controller.go (5)

Line range hint 469-538: LGTM! Well-structured changes with good logging.

The modifications to CreateInstance properly integrate the new findInstanceManagerAndIPs method and add helpful logging for debugging instance creation. The address handling is correct and safe.


649-676: LGTM! Well-implemented suspend functionality.

The SuspendInstance method has proper validation and error handling for v2 data engine compatibility.


678-705: LGTM! Well-implemented resume functionality.

The ResumeInstance method mirrors the suspend functionality with appropriate validation.


789-826: LGTM! Well-structured target deletion.

The DeleteTarget method includes proper validation and error handling for v2 data engine.


707-763: Verify port selection logic in switchover.

The port selection logic needs careful verification to ensure it handles all edge cases correctly.

✅ Verification successful

Let me analyze the port selection logic in the switchover code and the search results. The code shows:

  1. Port selection logic in SwitchOverTarget:
port := targetInstance.Status.TargetPortStart
if targetInstance.Status.StandbyTargetPortStart != 0 {
    port = targetInstance.Status.StandbyTargetPortStart
}

Let's verify how these ports are assigned and managed to ensure there are no conflicts:


Let me check one more aspect of the port management to understand how these ports are initially allocated:


Based on the analysis of the code and search results, I can now provide a final response about the port selection logic:

Port selection logic in target switchover is safe and well-managed

The port selection implementation is correct and handles edge cases appropriately:

  1. The port selection logic follows a clear precedence:

    • Uses StandbyTargetPortStart if available (non-zero)
    • Falls back to TargetPortStart otherwise
  2. Port management is properly handled:

    • Ports are tracked separately for target and standby instances
    • Port conflicts are prevented through instance manager's port allocation
    • Port status is consistently updated and logged during state changes
  3. The implementation includes safety checks:

    • Validates port availability before switchover
    • Maintains separate port ranges for different instance types
    • Logs port details for both initiator and target instances
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential port conflicts in target switchover scenarios

# Search for other instances of port assignment in the codebase
rg -A 3 "Status\.TargetPortStart|Status\.StandbyTargetPortStart" --type go

# Look for any existing port conflict handling
rg -A 3 "port.*conflict|port.*in.*use" --type go

Length of output: 100362


Script:

#!/bin/bash
# Search for port assignment and initialization logic
ast-grep --pattern 'Status.TargetPortStart = $_' 

# Look for port allocation/management code
rg -A 5 "TargetPortStart.*=|StandbyTargetPortStart.*=" --type go

# Check for port validation logic
rg -A 3 "validatePort|checkPort|verifyPort" --type go

Length of output: 4605


Script:

#!/bin/bash
# Search for port allocation and initialization in instance manager
rg -A 10 "func.*Instance.*Create|func.*Instance.*Start" --type go

# Look for port range definitions or configuration
rg -A 3 "PortRange|PortStart|PortEnd" --type go --no-filename | grep -v "vendor/"

Length of output: 52080

k8s/crds.yaml (3)

1791-1793: LGTM! The v2 data engine support changes are well-structured

The modifications to Engine, Replica and Node CRDs appropriately support v2 data engine live upgrades:

  1. Added targetNodeID fields with clear descriptions for scheduling during upgrades
  2. Added dataEngineUpgradeRequested field to Node CRD for upgrade coordination
  3. Added proper data engine type enums

Also applies to: 3298-3300, 2636-2638


2419-2513: 🛠️ Refactor suggestion

Add validation rules to NodeDataEngineUpgrade CRD

The NodeDataEngineUpgrade CRD would benefit from additional validation rules:

  1. spec.nodeID should be required since it's a key field
  2. spec.instanceManagerImage should have a minLength validation
  3. status.volumes should have a description explaining the volume upgrade states

Apply this diff to improve validation:

            properties:
              nodeID:
                description: NodeID specifies the ID of the node where the data engine upgrade will take place.
                type: string
+               minLength: 1
+             required:
+             - nodeID
              instanceManagerImage:
                description: InstanceManagerImage specifies the image to use for the instance manager upgrade.
                type: string
+               minLength: 1
              volumes:
                additionalProperties:
                  properties:
                    message:
                      type: string
                    state:
                      type: string
+                     enum:
+                     - pending
+                     - in-progress
+                     - completed
+                     - failed
                  type: object
+               description: Status of volume upgrades on this node, mapping volume name to upgrade state
                type: object

Likely invalid or redundant comment.


1313-1414: 🛠️ Refactor suggestion

Add validation rules to DataEngineUpgradeManager CRD

The DataEngineUpgradeManager CRD would benefit from additional validation rules:

  1. spec.dataEngine should be required since it's a key field
  2. status.state should have an enum of possible states

Apply this diff to improve validation:

            properties:
              dataEngine:
                description: DataEngine specifies the data engine type to upgrade to.
                enum:
                - v2
                type: string
+             required:
+             - dataEngine
              nodes:
                description: |-
                  Nodes specifies the list of nodes to perform the data engine upgrade on.
                  If empty, the upgrade will be performed on all available nodes.
                items:
                  type: string
                type: array
            type: object
          status:
            description: DataEngineUpgradeManagerStatus defines the observed state of the upgrade manager resource
            properties:
              state:
                type: string
+               enum:
+               - pending
+               - in-progress 
+               - completed
+               - failed

Likely invalid or redundant comment.

controller/volume_controller.go (9)

1827-1836: LGTM: Improved replica startup logic for v2 data engine

The code correctly handles replica startup for both v1 and v2 data engines, with proper image validation.


1873-1877: LGTM: Added v1 data engine image validation

The code correctly validates replica image matches volume image only for v1 data engine.


1923-1930: LGTM: Enhanced engine startup logic for v2 data engine

The code properly handles engine startup conditions for v2 data engine by checking both image and target node ID.


3261-3285: LGTM: Comprehensive detached volume upgrade handling

The code properly handles detached volume upgrades by:

  1. Resetting engine specs
  2. Cleaning up inactive replicas
  3. Updating volume status

3287-3301: LGTM: Proper replica update during detached state

The code correctly handles replica image updates based on data engine type during detached state.


3305-3331: LGTM: Well-structured helper functions for v2 data engine upgrade

The helper functions provide clear eligibility checks and completion status for v2 data engine upgrades.


5090-5098: LGTM: Comprehensive volume upgrade check

The code properly handles upgrade checks for both v1 and v2 data engines by validating:

  1. Image updates for both engines
  2. Target node changes for v2 engine

3212-3214: ⚠️ Potential issue

Handle engine state error during v2 data engine live upgrade

The TODO comment indicates missing error handling for non-running engine state during v2 data engine live upgrade.

Apply this fix to handle error states:

} else {
-    // TODO: what if e.Status.CurrentState != longhorn.InstanceStateRunning
+    if e.Status.CurrentState == longhorn.InstanceStateError {
+        log.Errorf("Engine entered error state during v2 data engine live upgrade")
+        return fmt.Errorf("engine in error state during live upgrade")
+    }
+    if e.Status.CurrentState != longhorn.InstanceStateRunning {
+        log.Debugf("Engine is in %v state, waiting for running state", e.Status.CurrentState)
+    }
}

Likely invalid or redundant comment.

🧰 Tools
🪛 golangci-lint (1.62.2)

3212-3212: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3212-3212: controller/volume_controller.go#L3212
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)


3184-3254: Verify v2 data engine live upgrade implementation

The implementation looks correct but needs thorough testing of the upgrade path, especially around engine suspension and resumption.

Consider adding metrics/events to track v2 data engine live upgrade progress and success rate.

✅ Verification successful

Let me gather more information about the v2 data engine live upgrade implementation to verify error handling and state transitions.


Based on the gathered information, I can now provide a final response as the code shows a well-structured and safe implementation of the v2 data engine live upgrade process.

V2 data engine live upgrade implementation is properly safeguarded

The implementation includes comprehensive safety checks and proper state handling:

  • Clear completion criteria via isV2DataEngineLiveUpgradeCompleted
  • Safe state transitions between Running and Suspended states
  • Proper handling of replica address maps during upgrade
  • Validation of target instance replacement before proceeding
  • Clean rollback mechanism by clearing UpgradedReplicaAddressMap

The only minor concern is the TODO comment for handling non-running engine states, but this is a defensive check as the main flow already ensures the engine is running before suspension.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify v2 data engine live upgrade implementation by checking logs for upgrade state transitions

# Check for any errors during engine suspension/resumption
rg -A 5 "Suspending engine for v2 data engine live upgrade|Resuming engine for live upgrade" 

# Check for target node changes during upgrade
rg -A 5 "Updating target node .* for v2 data engine live upgrade"

# Check for image updates during upgrade  
rg -A 5 "Updating image .* for v2 data engine live upgrade"

Length of output: 2040


Script:

#!/bin/bash
# Check for error handling and state transitions in v2 data engine upgrade

# Look for v2 data engine upgrade completion check
ast-grep --pattern 'func isV2DataEngineLiveUpgradeCompleted($$$)'

# Check for state validation and error handling
rg -A 10 "CurrentState.*InstanceState(Running|Suspended)" controller/volume_controller.go

# Look for any additional validation or safety checks
rg -B 2 -A 5 "TargetInstanceReplacementCreated|UpgradedReplicaAddressMap" controller/volume_controller.go

Length of output: 11358

🧰 Tools
🪛 golangci-lint (1.62.2)

3212-3212: SA9003: empty branch

(staticcheck)

🪛 GitHub Check: CodeFactor

[notice] 3225-3225: controller/volume_controller.go#L3225
If block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary) (indent-error-flow)


[notice] 3212-3212: controller/volume_controller.go#L3212
If block ends with a return statement, so drop this else and outdent its block. (indent-error-flow)

Comment on lines +532 to +539
if h.isEngineOfV2DataEngine(obj, spec.DataEngine) {
if isDataEngineNotBeingLiveUpgraded(spec, status) {
// Not in data engine live upgrade
break
}
} else {
break
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in v2 data engine state management.

The state management logic for v2 data engine is comprehensive, but the error handling could be improved:

Consider adding retries for transient errors and better error categorization:

 if h.isEngineOfV2DataEngine(obj, spec.DataEngine) {
+    maxRetries := 3
+    var lastErr error
+    for i := 0; i < maxRetries; i++ {
         if isDataEngineNotBeingLiveUpgraded(spec, status) {
-            break
+            return nil
         }
+        if lastErr != nil {
+            logrus.WithError(lastErr).Warnf("Retry %d/%d: Failed to handle v2 data engine state", i+1, maxRetries)
+        }
         if spec.TargetNodeID != "" {
             if h.isEngineResumeRequired(spec, status) {
-                err = h.resumeInstance(instanceName, spec.DataEngine, runtimeObj)
+                lastErr = h.resumeInstance(instanceName, spec.DataEngine, runtimeObj)
             } else {
-                err = h.createInstance(instanceName, spec.DataEngine, runtimeObj)
+                lastErr = h.createInstance(instanceName, spec.DataEngine, runtimeObj)
             }
+            if lastErr == nil {
+                break
+            }
         } else {
-            err = h.createInstance(instanceName, spec.DataEngine, runtimeObj)
+            lastErr = h.createInstance(instanceName, spec.DataEngine, runtimeObj)
+            if lastErr == nil {
+                break
+            }
         }
+    }
+    if lastErr != nil {
+        return fmt.Errorf("failed to handle v2 data engine state after %d retries: %v", maxRetries, lastErr)
+    }
 }

Also applies to: 544-570

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
controller/monitor/node_upgrade_monitor.go (2)

25-27: Consider making the sync period configurable

The hardcoded sync period of 3 seconds might not be optimal for all environments. Consider making this configurable through environment variables or configuration settings to allow tuning based on cluster size and performance requirements.

 const (
-	NodeDataEngineUpgradeMonitorSyncPeriod = 3 * time.Second
+	// DefaultNodeDataEngineUpgradeMonitorSyncPeriod is the default period between monitor sync operations
+	DefaultNodeDataEngineUpgradeMonitorSyncPeriod = 3 * time.Second
 )

141-165: Document valid state transitions

The state machine implementation would benefit from documentation describing the valid state transitions and their conditions. This would help maintainers understand the upgrade flow and prevent invalid state changes.

Add documentation like:

// handleNodeUpgrade processes the current state of the node upgrade.
// Valid state transitions are:
// - Undefined -> Initializing
// - Initializing -> SwitchingOver
// - SwitchingOver -> UpgradingInstanceManager
// - UpgradingInstanceManager -> SwitchingBack
// - SwitchingBack -> RebuildingReplica
// - RebuildingReplica -> Finalizing
// - Finalizing -> Completed
// Any state can transition to Error on failure
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 92c8741 and 387e4fe.

📒 Files selected for processing (1)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
🧰 Additional context used
📓 Learnings (1)
controller/monitor/node_upgrade_monitor.go (2)
Learnt from: james-munson
PR: longhorn/longhorn-manager#3211
File: app/post_upgrade.go:102-113
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In Go, when a deferred function references a variable like `err`, ensure that the variable is declared in the outer scope and not within an inner scope (such as within `if err := ...`), to prevent compilation errors and unintended variable shadowing.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/monitor/node_upgrade_monitor.go:351-357
Timestamp: 2024-11-25T23:55:02.080Z
Learning: `GetVolumeRO` guarantees that `volume` is non-nil when `err == nil`, so explicit nil checks after error handling are not needed.
🪛 GitHub Check: CodeFactor
controller/monitor/node_upgrade_monitor.go

[notice] 172-278: controller/monitor/node_upgrade_monitor.go#L172-L278
Complex Method


[notice] 461-540: controller/monitor/node_upgrade_monitor.go#L461-L540
Complex Method

🔇 Additional comments (1)
controller/monitor/node_upgrade_monitor.go (1)

279-316: ⚠️ Potential issue

Improve resource management in snapshotVolumes

The method defers engineClientProxy.Close() inside a loop, which delays cleanup until the function returns. This could lead to resource exhaustion with many volumes.

 engineClientProxy, err := engineapi.GetCompatibleClient(engine, engineCliClient, m.ds, m.logger, m.proxyConnCounter)
 if err != nil {
     return err
 }
-defer engineClientProxy.Close()

 snapLabels := map[string]string{types.GetLonghornLabelKey(types.LonghornLabelSnapshotForDataEngineLiveUpgrade): m.nodeUpgradeName}
 _, err = engineClientProxy.SnapshotCreate(engine, m.upgradeManagerName+"-"+util.RandomID(), snapLabels, freezeFilesystem)
+// Close the proxy immediately after use
+engineClientProxy.Close()
 if err != nil {
     return err
 }

Likely invalid or redundant comment.

Comment on lines 172 to 284
}
}()

// Check if the node is existing and ready
node, err := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
if err != nil {
err = errors.Wrapf(err, "failed to get node %v for nodeDataEngineUpgrade %v", nodeUpgrade.Status.OwnerID, nodeUpgrade.Name)
return
}
condition := types.GetCondition(node.Status.Conditions, longhorn.NodeConditionTypeReady)
if condition.Status != longhorn.ConditionStatusTrue {
err = errors.Errorf("node %v is not ready", node.Name)
return
}

defer func() {
if err != nil {
node, errGet := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
if errGet != nil {
log.WithError(errGet).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
return
}

node.Spec.DataEngineUpgradeRequested = false
if _, errUpdate := m.ds.UpdateNode(node); errUpdate != nil {
log.WithError(errUpdate).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
return
}
}
}()

// Mark the node as upgrade requested
var volumes map[string]*longhorn.VolumeUpgradeStatus
if node.Spec.DataEngineUpgradeRequested {
condition := types.GetCondition(node.Status.Conditions, longhorn.NodeConditionTypeSchedulable)
if condition.Status == longhorn.ConditionStatusTrue {
log.Infof("DataEngineUpgradeRequested of node %v is set to true, but it is still schedulable", nodeUpgrade.Status.OwnerID)
// Return here and check again in the next reconciliation
return
}

var im *longhorn.InstanceManager
im, err = m.ds.GetDefaultInstanceManagerByNodeRO(nodeUpgrade.Status.OwnerID, longhorn.DataEngineTypeV2)
if err != nil {
return
}

for name, engine := range im.Status.InstanceEngines {
for _, path := range engine.Status.NvmeSubsystem.Paths {
if path.State != "live" {
m.nodeUpgradeStatus.Message = fmt.Sprintf("NVMe subsystem path for engine %v is in state %v", name, path.State)
}
}
}

replicas, errList := m.ds.ListReplicasByNodeRO(nodeUpgrade.Status.OwnerID)
if errList != nil {
err = errors.Wrapf(errList, "failed to list replicas on node %v", nodeUpgrade.Status.OwnerID)
return
}

for _, r := range replicas {
volume, errGet := m.ds.GetVolumeRO(r.Spec.VolumeName)
if errGet != nil {
err = errors.Wrapf(errGet, "failed to get volume %v for replica %v", r.Spec.VolumeName, r.Name)
return
}
if volume.Spec.NodeID != "" {
if volume.Spec.NumberOfReplicas == 1 {
err = errors.Errorf("volume %v has only 1 replica", volume.Name)
return
}
}
}

volumes, err = m.listVolumes(nodeUpgrade)
if err != nil {
err = errors.Wrapf(err, "failed to list volumes on node %v", nodeUpgrade.Status.OwnerID)
return
}

err = m.snapshotVolumes(volumes)
if err != nil {
err = errors.Wrap(err, "failed to snapshot volumes")
return
}

m.nodeUpgradeStatus.State = longhorn.UpgradeStateSwitchingOver
m.nodeUpgradeStatus.Message = ""
m.nodeUpgradeStatus.Volumes = volumes
} else {
node.Spec.DataEngineUpgradeRequested = true
_, err = m.ds.UpdateNode(node)
if err != nil {
err = errors.Wrapf(err, "failed to update node %v to set DataEngineUpgradeRequested to true", nodeUpgrade.Status.OwnerID)
return
}
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Reduce complexity in handleUpgradeStateInitializing

This method is complex and handles multiple responsibilities. Consider breaking it down into smaller, focused methods for better maintainability and testing.

Extract the following into separate methods:

  1. Node validation logic (lines 183-192)
  2. Volume validation logic (lines 234-252)
  3. Snapshot creation logic (lines 254-264)

Example:

func (m *NodeDataEngineUpgradeMonitor) validateNode(nodeID string) error {
    node, err := m.ds.GetNode(nodeID)
    if err != nil {
        return errors.Wrapf(err, "failed to get node %v", nodeID)
    }
    condition := types.GetCondition(node.Status.Conditions, longhorn.NodeConditionTypeReady)
    if condition.Status != longhorn.ConditionStatusTrue {
        return errors.Errorf("node %v is not ready", node.Name)
    }
    return nil
}
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 172-278: controller/monitor/node_upgrade_monitor.go#L172-L278
Complex Method

Comment on lines +790 to +857
func (m *NodeDataEngineUpgradeMonitor) updateVolumeForSwitchBack(volumeName, targetNode string) {
log := m.logger.WithFields(logrus.Fields{"volume": volumeName, "targetNode": targetNode})

var err error

defer func() {
if err != nil {
m.nodeUpgradeStatus.Volumes[volumeName].Message = err.Error()
if datastore.ErrorIsNotFound(err) {
m.nodeUpgradeStatus.Volumes[volumeName].State = longhorn.UpgradeStateCompleted
} else {
m.nodeUpgradeStatus.Volumes[volumeName].State = longhorn.UpgradeStateError
m.logger.WithError(err).Warnf("Failed to update volume %v for switch over", volumeName)
}
}
}()

volume, err := m.ds.GetVolume(volumeName)
if err != nil {
err = errors.Wrapf(err, "failed to get volume %v for switch over", volumeName)
return
}

if volume.Spec.NodeID == "" {
if volume.Status.OwnerID == "" {
m.nodeUpgradeStatus.Volumes[volume.Name].State = longhorn.UpgradeStateCompleted
m.nodeUpgradeStatus.Volumes[volume.Name].Message = "Volume is detached"
} else {
m.nodeUpgradeStatus.Volumes[volume.Name].State = longhorn.UpgradeStateError
m.nodeUpgradeStatus.Volumes[volume.Name].Message = "Volume is being detached"
}
return
}

if volume.Spec.TargetNodeID != "" {
if volume.Spec.TargetNodeID == volume.Spec.NodeID {
m.nodeUpgradeStatus.Volumes[volume.Name].State = longhorn.UpgradeStateSwitchingBack
} else {
m.nodeUpgradeStatus.Volumes[volume.Name].State = longhorn.UpgradeStateError
m.nodeUpgradeStatus.Volumes[volume.Name].Message = "Unexpected target node that is different from the attached node"
}
} else {
m.nodeUpgradeStatus.Volumes[volume.Name].Message = "Volume is switched over"
return
}

if volume.Spec.TargetNodeID != targetNode {
volume.Spec.TargetNodeID = targetNode

log.Infof("Updating volume %v to target node %v for switch back", volumeName, targetNode)
_, err := m.ds.UpdateVolume(volume)
if err != nil {
return
}

m.nodeUpgradeStatus.Volumes[volume.Name].State = longhorn.UpgradeStateSwitchingBack
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Inconsistent error handling in updateVolumeForSwitchBack

The error handling is inconsistent. Some errors are wrapped while others are returned directly. Also, the error message in the log doesn't match the actual operation.

 _, err := m.ds.UpdateVolume(volume)
 if err != nil {
+    err = errors.Wrapf(err, "failed to update volume %v for switch back", volumeName)
     return
 }

 m.nodeUpgradeStatus.Volumes[volume.Name].State = longhorn.UpgradeStateSwitchingBack
-m.logger.WithError(err).Warnf("Failed to update volume %v for switch over", volumeName)
+m.logger.WithError(err).Warnf("Failed to update volume %v for switch back", volumeName)

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +76 to +81
if err := wait.PollUntilContextCancel(m.ctx, m.syncPeriod, true, func(context.Context) (bool, error) {
if err := m.run(struct{}{}); err != nil {
m.logger.WithError(err).Error("Stopped monitoring upgrade monitor")
}
return false, nil
}); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in polling loop

The error from m.run() is logged but not propagated to the polling function. This means the monitor will continue running even after encountering errors that might require attention.

- if err := wait.PollUntilContextCancel(m.ctx, m.syncPeriod, true, func(context.Context) (bool, error) {
-   if err := m.run(struct{}{}); err != nil {
-     m.logger.WithError(err).Error("Stopped monitoring upgrade monitor")
-   }
-   return false, nil
- }); err != nil {
+ if err := wait.PollUntilContextCancel(m.ctx, m.syncPeriod, true, func(context.Context) (bool, error) {
+   if err := m.run(struct{}{}); err != nil {
+     m.logger.WithError(err).Error("Error in upgrade monitor")
+     // Return error to potentially trigger backoff or retry logic
+     return false, err
+   }
+   return false, nil
+ }); err != nil {

Committable suggestion skipped: line range outside the PR's diff.

Signed-off-by: Derek Su <[email protected]>
Signed-off-by: Derek Su <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (8)
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (4)

95-97: Consider adding validation for TargetNodeID

The field is well-documented and properly structured. Consider adding a kubebuilder validation to ensure the target node exists.

Add node validation:

 // TargetNodeID is specifies the node where the volume's initiator instance should be scheduled during data engine live upgrade.
 // +optional
+// +kubebuilder:validation:MinLength=1
 TargetNodeID string `json:"targetNodeID"`

116-120: Consider grouping related target fields together

The target-related fields are currently split across the struct. Consider grouping them together for better maintainability.

Suggested organization:

type InstanceStatus struct {
  // ... existing fields ...
  IP string `json:"ip"`
  StorageIP string `json:"storageIP"`
  Port int `json:"port"`
+ TargetIP string `json:"targetIP"`
+ StorageTargetIP string `json:"storageTargetIP"`
+ TargetPort int `json:"targetPort"`
+ CurrentTargetNodeID string `json:"currentTargetNodeID"`
+ TargetInstanceReplacementCreated bool `json:"targetInstanceReplacementCreated"`
  // ... remaining fields ...
-  TargetIP string `json:"targetIP"`
-  StorageTargetIP string `json:"storageTargetIP"`
-  TargetPort int `json:"targetPort"`
-  CurrentTargetNodeID string `json:"currentTargetNodeID"`
-  TargetInstanceReplacementCreated bool `json:"targetInstanceReplacementCreated"`
}

Also applies to: 130-133


155-157: Add documentation for standby port fields

The new fields are well-structured but would benefit from documentation explaining their purpose in the upgrade process.

Add documentation:

+ // StandbyTargetPortEnd specifies the end of port range for standby target during upgrade
  // +optional
  StandbyTargetPortEnd int32 `json:"standbyTargetPortEnd"`
+ // StandbyTargetPortStart specifies the start of port range for standby target during upgrade
  // +optional
  StandbyTargetPortStart int32 `json:"standbyTargetPortStart"`

Also applies to: 164-166


168-183: Consider adding validations for NVMe fields

The NVMe types are well-structured but would benefit from additional validation rules.

Add validations:

type NvmeDevicePath struct {
  // +optional
+ // +kubebuilder:validation:Enum=tcp;rdma
  Trtype string `json:"trtype"`
  // +optional
+ // +kubebuilder:validation:Pattern=^[0-9A-Za-z.-]+$
  Traddr string `json:"traddr"`
  // +optional
+ // +kubebuilder:validation:Pattern=^[0-9]+$
  Trsvcid string `json:"trsvcid"`
  // +optional
+ // +kubebuilder:validation:Pattern=^[0-9A-Za-z.-]+$
  SrcAddr string `json:"srcAddr"`
  // +optional
+ // +kubebuilder:validation:Enum=live;connecting;error
  State string `json:"state"`
}
engineapi/instance_manager.go (1)

546-569: Enhance error messages with specific address information.

While the error handling is good, the error messages could be more helpful for debugging by including the specific address that caused the error.

Consider enhancing the error messages like this:

-		return nil, errors.New("invalid initiator address format")
+		return nil, fmt.Errorf("invalid initiator address format: %s", initiatorAddress)

-		return nil, errors.New("invalid target address format")
+		return nil, fmt.Errorf("invalid target address format: %s", targetAddress)

-			return nil, errors.New("invalid replica address format")
+			return nil, fmt.Errorf("invalid replica address format: %s", addr)
controller/instance_handler.go (1)

58-165: Consider refactoring for improved maintainability

The syncStatusIPsAndPorts function could benefit from the following improvements:

  1. Extract target instance handling logic into a separate function to reduce complexity
  2. Consolidate similar logging patterns into a helper function
  3. Consider using early returns to reduce nesting levels

Consider this refactor:

 func (h *InstanceHandler) syncStatusIPsAndPorts(im *longhorn.InstanceManager, obj interface{}, spec *longhorn.InstanceSpec, status *longhorn.InstanceStatus, instanceName string, instance longhorn.InstanceProcess) {
+    if err := h.syncBasicStatusIPsAndPorts(im, status, instanceName); err != nil {
+        return
+    }
+
+    if !h.instanceManagerHandler.IsEngine(obj) {
+        return
+    }
+
+    if types.IsDataEngineV2(spec.DataEngine) && spec.TargetNodeID != "" {
+        if err := h.syncTargetInstanceStatus(spec, status, instanceName); err != nil {
+            return
+        }
+    } else {
+        h.syncNonTargetInstanceStatus(im, status, instance, instanceName)
+    }
 }

+func (h *InstanceHandler) syncBasicStatusIPsAndPorts(im *longhorn.InstanceManager, status *longhorn.InstanceStatus, instanceName string) error {
     imPod, err := h.ds.GetPodRO(im.Namespace, im.Name)
     if err != nil {
-        logrus.WithError(err).Errorf("Failed to get instance manager pod from %v", im.Name)
-        return
+        return errors.Wrapf(err, "failed to get instance manager pod from %v", im.Name)
     }
     // ... rest of the basic sync logic
 }
controller/monitor/node_upgrade_monitor.go (2)

478-482: Unnecessary nil check after successful GetVolumeRO call

The GetVolumeRO function guarantees that volume is non-nil when err == nil, so the nil check on volume is unnecessary and can be removed to simplify the code.

Apply this diff to remove the unnecessary nil check:

if volume.Status.OwnerID != nodeUpgrade.Status.OwnerID {
    m.nodeUpgradeStatus.Volumes[name].State = longhorn.UpgradeStateError
    m.nodeUpgradeStatus.Volumes[name].Message = fmt.Sprintf("Volume %v is not owned by node %v", name, nodeUpgrade.Status.OwnerID)
    continue
}
- if volume == nil {
-     m.nodeUpgradeStatus.Volumes[name].State = longhorn.UpgradeStateError
-     m.nodeUpgradeStatus.Volumes[name].Message = fmt.Sprintf("Volume %v is not found", name)
-     continue
- }

172-283: Refactor handleUpgradeStateInitializing to reduce complexity

The handleUpgradeStateInitializing method is lengthy and handles multiple responsibilities, including node validation, volume validation, snapshot creation, and status updates. Breaking it down into smaller, focused methods would enhance readability, maintainability, and testability.

Consider extracting the following sections into separate methods:

  1. Node validation logic (lines 183-192)

    func (m *NodeDataEngineUpgradeMonitor) validateNode(nodeID string) error {
        node, err := m.ds.GetNode(nodeID)
        if err != nil {
            return errors.Wrapf(err, "failed to get node %v", nodeID)
        }
        condition := types.GetCondition(node.Status.Conditions, longhorn.NodeConditionTypeReady)
        if condition.Status != longhorn.ConditionStatusTrue {
            return errors.Errorf("node %v is not ready", node.Name)
        }
        return nil
    }
  2. Volume validation logic (lines 234-258)

    func (m *NodeDataEngineUpgradeMonitor) validateVolumes(nodeUpgrade *longhorn.NodeDataEngineUpgrade) error {
        // Volume validation logic
        return nil
    }
  3. Snapshot creation logic (lines 260-270)

    func (m *NodeDataEngineUpgradeMonitor) createVolumeSnapshots(volumes map[string]*longhorn.VolumeUpgradeStatus) error {
        // Snapshot creation logic
        return nil
    }

Refactoring these sections will make the code more modular and easier to maintain.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 387e4fe and 63665fc.

⛔ Files ignored due to path filters (20)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/api/instance.go is excluded by !vendor/**
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/client/instance.go is excluded by !vendor/**
  • vendor/github.com/longhorn/longhorn-instance-manager/pkg/util/util.go is excluded by !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/imrpc/instance.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_compare.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_format.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_forward.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertion_order.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/assertions.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/yaml/yaml_custom.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/yaml/yaml_default.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/assert/yaml/yaml_fail.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go.tmpl is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require_forward.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/requirements.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (6)
  • controller/instance_handler.go (15 hunks)
  • controller/monitor/node_upgrade_monitor.go (1 hunks)
  • engineapi/instance_manager.go (6 hunks)
  • k8s/crds.yaml (87 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (6 hunks)
  • k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (6 hunks)
🧰 Additional context used
📓 Learnings (2)
controller/instance_handler.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/instance_handler.go:919-939
Timestamp: 2024-11-25T23:56:53.252Z
Learning: In the Longhorn Manager Go code, the function `engineapi.NewInstanceManagerClient` does not accept a `context.Context` parameter. Therefore, we cannot pass `ctx` to it in functions like `isInstanceExist` in `controller/instance_handler.go`.
controller/monitor/node_upgrade_monitor.go (2)
Learnt from: james-munson
PR: longhorn/longhorn-manager#3211
File: app/post_upgrade.go:102-113
Timestamp: 2024-11-10T16:45:04.898Z
Learning: In Go, when a deferred function references a variable like `err`, ensure that the variable is declared in the outer scope and not within an inner scope (such as within `if err := ...`), to prevent compilation errors and unintended variable shadowing.
Learnt from: derekbit
PR: longhorn/longhorn-manager#3282
File: controller/monitor/node_upgrade_monitor.go:351-357
Timestamp: 2024-11-25T23:55:02.080Z
Learning: `GetVolumeRO` guarantees that `volume` is non-nil when `err == nil`, so explicit nil checks after error handling are not needed.
🔇 Additional comments (19)
k8s/pkg/apis/longhorn/v1beta2/instancemanager.go (1)

61-67: LGTM: New instance state for upgrade support

The addition of InstanceStateSuspended state aligns with the live upgrade feature requirements.

Let's verify the state handling:

✅ Verification successful

The new suspended state is properly integrated and handled

The verification confirms that InstanceStateSuspended is well-integrated into the codebase:

  • Properly handled in instance state transitions in controller/instance_handler.go
  • Used in engine controller for live upgrade scenarios
  • Integrated into volume controller logic for managing engine state during upgrades
  • Includes proper validation to ensure it's only used with v2 data engine
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the new suspended state is properly handled
# Look for state transition handling code

rg -A 5 "InstanceStateSuspended" 
ast-grep --pattern 'switch $_ {
  case $$$
  case "suspended":
    $$$
}'

Length of output: 4108

engineapi/instance_manager.go (3)

6-6: LGTM: Import addition is appropriate.

The addition of the "net" package import is necessary for host:port parsing functionality used in the new code.


277-288: LGTM: NVMe subsystem handling is well-implemented.

The changes properly initialize and populate the NVMe subsystem information and add necessary fields for managing target ports. The implementation is thorough and maintains proper type safety.

Also applies to: 296-306


880-967: LGTM: Well-structured implementation of engine instance management functions.

The new functions for engine instance management (suspend, resume, switchover, delete target) are well-implemented with:

  • Proper parameter validation
  • Consistent error handling
  • Clear separation of concerns between data engine types
  • Proper type safety checks
controller/instance_handler.go (4)

38-47: LGTM: Interface changes support v2 data engine live upgrade

The new methods added to the InstanceManagerHandler interface provide a clear and comprehensive API for managing instance lifecycle during live upgrades.


224-227: ⚠️ Potential issue

Resolve error handling uncertainty

The TODO comment indicates uncertainty about error handling when checking node upgrade status. This should be resolved to ensure consistent behavior.


532-570: 🛠️ Refactor suggestion

Improve error handling in v2 data engine state management

The state management logic for v2 data engine is comprehensive, but the error handling could be improved.


925-945: 🛠️ Refactor suggestion

Add timeout to instance existence check

The isInstanceExist function makes network calls but doesn't have a timeout. This could lead to hanging in case of network issues.

k8s/pkg/apis/longhorn/v1beta2/zz_generated.deepcopy.go (4)

981-1092: LGTM: DataEngineUpgradeManager deepcopy implementation

The implementation correctly handles all fields including the nested map of UpgradeNodeStatus pointers in the Status field.


1784-1891: LGTM: NodeDataEngineUpgrade deepcopy implementation

The implementation correctly handles all fields including the nested map of VolumeUpgradeStatus pointers in the Status field.


3031-3046: Addressing past review comment: UpgradeNodeStatus deepcopy is correct

The current implementation with *out = *in is correct since UpgradeNodeStatus only contains simple types (State and Message are strings). No explicit field copying is needed.


3366-3381: Addressing past review comment: VolumeUpgradeStatus deepcopy is correct

The current implementation with *out = *in is correct since VolumeUpgradeStatus only contains simple types (State and Message are strings). No explicit field copying is needed.

k8s/crds.yaml (4)

2419-2513: Previous validation suggestions still apply

The validation improvements suggested in the previous review for NodeDataEngineUpgrade CRD are still applicable.


1313-1414: 🛠️ Refactor suggestion

Add validation rules to DataEngineUpgradeManager CRD

The DataEngineUpgradeManager CRD would benefit from additional validation rules:

  1. spec.dataEngine should be required since it's a key field
  2. status.state should have an enum of possible states

Apply this diff to improve validation:

            properties:
              dataEngine:
                description: DataEngine specifies the data engine type to upgrade to.
                enum:
                - v2
                type: string
+             required:
+             - dataEngine
              nodes:
                description: |-
                  Nodes specifies the list of nodes to perform the data engine upgrade on.
                  If empty, the upgrade will be performed on all available nodes.
                items:
                  type: string
                type: array
            type: object
          status:
            description: DataEngineUpgradeManagerStatus defines the observed state of the upgrade manager resource
            properties:
              state:
                type: string
+               enum:
+               - pending
+               - in-progress 
+               - completed
+               - failed

Likely invalid or redundant comment.


2636-2638: Verify upgrade trigger mechanism

The dataEngineUpgradeRequested field has been added to trigger instance manager upgrades for v2 volumes on specific nodes. Let's verify the upgrade trigger mechanism implementation.

✅ Verification successful

Let me gather more information about the upgrade trigger mechanism.


Based on the search results, I can now provide a final response about the upgrade trigger mechanism:

Verified upgrade trigger mechanism implementation

The dataEngineUpgradeRequested field is properly implemented as part of the node data engine upgrade workflow:

  • The field is used by the NodeDataEngineUpgradeMonitor to trigger and track instance manager upgrades
  • The upgrade process includes:
    1. Setting node.Spec.DataEngineUpgradeRequested = true to initiate the upgrade
    2. Handling the upgrade state via UpgradeStateUpgradingInstanceManager
    3. Managing instance manager recreation with the new image
    4. Setting node.Spec.DataEngineUpgradeRequested = false upon completion

The implementation includes proper error handling and state management throughout the upgrade process.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the upgrade trigger mechanism implementation

# Search for dataEngineUpgradeRequested field usage
echo "Searching for dataEngineUpgradeRequested usage..."
rg -A 5 "dataEngineUpgradeRequested" --type go

# Search for upgrade trigger related code
echo "Searching for upgrade trigger implementation..."
ast-grep --pattern $'func $_($$) {
  $$$
  dataEngineUpgradeRequested
  $$$
}'

Length of output: 741


Script:

#!/bin/bash
# Search for code that handles node upgrades and instance manager upgrades
echo "Searching for node upgrade related code..."
rg -A 10 "InstanceManager.*[Uu]pgrade" --type go

# Search for code that modifies dataEngineUpgradeRequested
echo "Searching for code that sets dataEngineUpgradeRequested..."
rg -A 5 "DataEngineUpgradeRequested.*=.*" --type go

# Search for instance manager upgrade controller
echo "Searching for instance manager controller..."
fd "instance.*manager.*controller.*go" --type f --exec rg -A 5 "func.*[Uu]pgrade" {}

Length of output: 19949


1791-1793: Verify consistent usage of targetNodeID field

The targetNodeID field has been added to both Engine and Replica CRDs for coordinating instance placement during data engine live upgrades. Let's verify the consistent usage of this field across the codebase.

Also applies to: 3298-3300

✅ Verification successful

Let me gather more information about the usage of targetNodeID in the context of data engine upgrades.


Let me search for more specific usage patterns to verify the consistency.


Based on the extensive code analysis, I can now generate the final response since I have sufficient evidence about the targetNodeID field usage.

Consistent usage of targetNodeID field confirmed

The field is properly integrated across the codebase with consistent validation and handling:

  • Volume controller properly manages targetNodeID during data engine upgrades
  • Engine controller correctly handles target instance placement based on targetNodeID
  • Comprehensive validation in webhook ensures proper usage:
    • Only allowed for data engine v2 volumes
    • Requires healthy volume state and multiple replicas
    • Prevents setting to same node as current node
    • Validates upgrade readiness conditions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of targetNodeID field in Engine and Replica controllers

# Search for targetNodeID field usage in controllers
echo "Searching for targetNodeID usage in controllers..."
rg -A 5 "targetNodeID" --type go

Length of output: 5448


Script:

#!/bin/bash
# Search for data engine upgrade related code paths that use targetNodeID
echo "Searching for data engine upgrade related code..."
rg -B 3 -A 3 "DataEngineV2.*upgrade" --type go

# Search for volume controller handling of targetNodeID
echo -e "\nSearching for volume controller handling of targetNodeID..."
ast-grep --pattern 'func (vc *VolumeController) $$$targetNodeID$$$'

Length of output: 847


Script:

#!/bin/bash
# Search for volume spec targetNodeID validation and handling
echo "Searching for volume spec targetNodeID handling..."
rg -B 5 -A 5 "Spec\.TargetNodeID" --type go

# Search for engine and replica handling during upgrade
echo -e "\nSearching for engine and replica handling during upgrade..."
rg -B 3 -A 5 "DataEngineUpgrade" --type go

Length of output: 136728

controller/monitor/node_upgrade_monitor.go (3)

194-208: 🛠️ Refactor suggestion

Avoid variable shadowing in deferred function to prevent confusion and errors

In the deferred function starting at line 194, the variable err is redefined, shadowing the outer err variable. This can lead to confusion and unintended behavior. It's better to use different variable names inside the deferred function to prevent shadowing.

Apply this diff to prevent variable shadowing:

defer func() {
    if err != nil {
-       node, err := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
-       if err != nil {
-           log.WithError(err).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
+       node, getNodeErr := m.ds.GetNode(nodeUpgrade.Status.OwnerID)
+       if getNodeErr != nil {
+           log.WithError(getNodeErr).Warnf("Failed to get node %v", nodeUpgrade.Status.OwnerID)
            return
        }
        node.Spec.DataEngineUpgradeRequested = false
-       if _, err := m.ds.UpdateNode(node); err != nil {
-           log.WithError(err).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
+       if _, updateErr := m.ds.UpdateNode(node); updateErr != nil {
+           log.WithError(updateErr).Warnf("Failed to update node %v to set DataEngineUpgradeRequested to false", nodeUpgrade.Status.OwnerID)
            return
        }
    }
}()

Likely invalid or redundant comment.


817-853: 🛠️ Refactor suggestion

Avoid variable shadowing of err within the function to prevent confusion

Within updateVolumeForSwitchBack, the variable err is redefined when calling m.ds.GetVolume(volumeName) and subsequent operations, shadowing the outer err used in deferred error handling. This can lead to unintended behavior or confusion.

Apply this diff to use different variable names for errors:

volume, getVolumeErr := m.ds.GetVolume(volumeName)
if getVolumeErr != nil {
-   err = errors.Wrapf(err, "failed to get volume %v for switch back", volumeName)
+   err = errors.Wrapf(getVolumeErr, "failed to get volume %v for switch back", volumeName)
    return
}

// ...

if _, updateErr := m.ds.UpdateVolume(volume); updateErr != nil {
-   err = errors.Wrapf(err, "failed to update volume %v to target node %v for switch back", volumeName, targetNode)
+   err = errors.Wrapf(updateErr, "failed to update volume %v to target node %v for switch back", volumeName, targetNode)
    return
}

This prevents shadowing of the outer err variable and maintains clarity in error handling.

Likely invalid or redundant comment.


311-311: ⚠️ Potential issue

Avoid deferring resource closures inside loops to prevent resource leaks

Deferring engineClientProxy.Close() inside a loop causes all Close() calls to be postponed until the function returns, which can lead to resource exhaustion if the loop iterates many times. It's better to close the proxy immediately after use.

Apply this diff to close the proxy after each iteration:

engineClientProxy, err := engineapi.GetCompatibleClient(engine, engineCliClient, m.ds, m.logger, m.proxyConnCounter)
if err != nil {
    return err
}
- defer engineClientProxy.Close()

// ... existing code ...

_, err = engineClientProxy.SnapshotCreate(engine, m.upgradeManagerName+"-"+util.RandomID(), snapLabels, freezeFilesystem)
if err != nil {
    return err
}
+ // Close the proxy before the next iteration
+ engineClientProxy.Close()

Likely invalid or redundant comment.

Comment on lines +740 to +743
logrus.Infof("Debug ========> InstanceList")
instances, err := c.instanceServiceGrpcClient.InstanceList()
if err != nil {
logrus.Infof("Debug ========> InstanceList err=%v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove temporary debug logging statements.

These debug logging statements appear to be temporary and should be removed from production code.

Remove the debug logging:

-	logrus.Infof("Debug ========> InstanceList")
 	instances, err := c.instanceServiceGrpcClient.InstanceList()
 	if err != nil {
-		logrus.Infof("Debug ========> InstanceList err=%v", err)
 		return nil, err
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logrus.Infof("Debug ========> InstanceList")
instances, err := c.instanceServiceGrpcClient.InstanceList()
if err != nil {
logrus.Infof("Debug ========> InstanceList err=%v", err)
instances, err := c.instanceServiceGrpcClient.InstanceList()
if err != nil {
return nil, err
}

Comment on lines +812 to +813
m.logger.WithError(err).Warnf("Failed to update volume %v for switch over", volumeName)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the log message to reflect the correct operation

In the updateVolumeForSwitchBack method, the log message references "switch over," but this method handles "switch back." This could cause confusion when debugging or reviewing logs.

Apply this diff to correct the log message:

if datastore.ErrorIsNotFound(err) {
    m.nodeUpgradeStatus.Volumes[volumeName].State = longhorn.UpgradeStateCompleted
} else {
    m.nodeUpgradeStatus.Volumes[volumeName].State = longhorn.UpgradeStateError
-   m.logger.WithError(err).Warnf("Failed to update volume %v for switch over", volumeName)
+   m.logger.WithError(err).Warnf("Failed to update volume %v for switch back", volumeName)
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
m.logger.WithError(err).Warnf("Failed to update volume %v for switch over", volumeName)
}
m.logger.WithError(err).Warnf("Failed to update volume %v for switch back", volumeName)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant