Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup new Node class to improve consistency of node usage in dlg-engine #283

Merged
merged 14 commits into from
Oct 22, 2024

Conversation

myxie
Copy link
Collaborator

@myxie myxie commented Sep 13, 2024

Issue

In #277 I added additional port options to the CLI so that it is possible to specify non-default RPC and Event ports. The intention of this was to allow for:

  1. Local testing of multiple Node Managers (e.g. partitioning)
  2. Deployment of DALiuGE in environments where the existing ports are used or blocked.

The motivating factor for this was that we had difficulties identifying what was causing issues getting Setonix deployment to work (LIU-377 (#271)).

One of the things that came out of discussions about the issue is the use of a unit test that would catch that sort of issue when partitioning across multiple nodes, using some of the additions in #277 when adding non-default port support. This is currently a WIP in #281.

I was unable to progress work on the unittest in #281 because I stumbled on the following limitations in DALiuGE:

# Set up event channels subscriptions
for nodesub in relationships:
events_port = constants.NODE_DEFAULT_EVENTS_PORT

This method ( node_manager.add_node_subscriptions,) updates the drop-relationships in a given session to make sure that we can send events between drops on different nodes; this is essential to when notifying drops that, say, data has been written and you can start consuming that from another node.

The issue is line 436:

events_port = constants.NODE_DEFAULT_EVENTS_PORT

Regardless of what ports the related host is actually running on, we will only ever be subscribing to events on the default port.

We do this too for RPC subscriptions in the session, as well - this causes the DropProxies to be incorrectly managed as well:

for host, droprels in relationships.items():
# Make sure we have DROPRel tuples
droprels = [DROPRel(*x) for x in droprels]
# Sanitize the host/rpc_port info if needed
rpc_port = constants.NODE_DEFAULT_RPC_PORT
if type(host) is tuple:
host, _, rpc_port = host

Summary
We need to find a way to get non-default RPC/Event port information passed between nodes. Ideally, we don't want to be relying on the default event port as the only port on which we communicate, especially if we want to have the flexibility to configure it in the future. Additionally, it would be ideal to not see the DEFAULTS_* variables anywhere except a default value in a constructor, as this can lead to situations where all behaviour follows the default behaviour.

The challenge here is that our current approach to manager information about hosts and ports is strings; if we don't specify the port, then we use the defaults, and the way we detect if we haven't used defaults is splitting string to get the port number:

if node.find(":") > 0:
node, port = node.split(":")
with NodeManagerClient(host=node, port=port) as dm:

This approach is fine when we only want to keep track of up to 1 port, but if we need to send around up to 3 ports, that's a lot of splitting and conditionals.

I propose a solution that (hopefully) improves on the current node-host-port-management situation, and also allows us to send complete information about the setup of the hosts.

Solution

This PR introduces the manager_data.Node class, which contains in it the 'Daliuge Node Protocol' (not it's actual name). The node protocol is incredibly basic and just extends the existing pattern:

"host:port:event_port:rpc_port"

The idea is this class stores all information about the host, and performs all the necessary splits in the constructor only once, rather than having to do it every time we need it. Then, when the data needs to be stored or communicated 'over-the-wire', we convert it to a string and send it the way we normally have.

In my application of the Nodes class, I have tried to ensure is the job of the Manager REST APIs to do the conversion to/from Node, and then the *Managers can use the Node class as expected. This reduces the need for consistent checking of if we have a port; the Node object will always have a port, and it will often be the default port (but now we don't have to worry about it!).

I have confirmed that this works using the ArrayLoop graph and have ensured that all unittests are passing.

- rest.py acts as the interface between the JSON/string representation of nodes, and the DropManagers, which should only use the Node class from now on.

Note: This will not successfully deploy a graph, as the translator is not functional with the new information.
- Test with HelloWorld-Simple and ArrayLoop graphs.
- Unittests appear to work as well
@coveralls
Copy link

coveralls commented Sep 18, 2024

Coverage Status

coverage: 79.625% (-0.008%) from 79.633%
when pulling 4efc36a on node-experiments
into 0fafa38 on master.

@myxie myxie marked this pull request as ready for review September 18, 2024 11:21
Copy link
Contributor

sourcery-ai bot commented Sep 18, 2024

Reviewer's Guide by Sourcery

This pull request introduces a new Node class to improve consistency in node usage across the dlg-engine codebase. The changes primarily focus on refactoring how node information is handled, stored, and passed between different components of the system. The main goals are to standardize node representation, improve flexibility in port configuration, and reduce reliance on default values.

File-Level Changes

Change Details Files
Introduced a new Node class to encapsulate node information
  • Created a new Node class with host, port, events_port, and rpc_port attributes
  • Implemented serialization and deserialization methods for the Node class
  • Added equality and hash methods to support using Node objects in collections
daliuge-engine/dlg/manager/manager_data.py
Refactored node handling in CompositeManager and related classes
  • Updated CompositeManager to use Node objects instead of strings for node representation
  • Modified methods like add_node, remove_node, and dmAt to work with Node objects
  • Updated property getters and setters to return Node objects or their string representations
daliuge-engine/dlg/manager/composite_manager.py
Updated REST API and client code to work with the new Node class
  • Modified REST endpoints to accept and return Node objects as strings
  • Updated client code to create Node objects from string representations
  • Refactored methods in NodeManagerClient to work with Node objects
daliuge-engine/dlg/manager/rest.py
daliuge-common/dlg/clients.py
Refactored session management to use Node objects
  • Updated add_node_subscriptions method to work with Node objects
  • Modified how node information is stored and retrieved in sessions
daliuge-engine/dlg/manager/session.py
Updated test cases to work with the new Node class
  • Modified test setup code to create Node objects instead of using string representations
  • Updated assertions in tests to check for Node objects or their string representations
  • Refactored helper functions in tests to work with Node objects
daliuge-engine/test/manager/test_mm.py
daliuge-engine/test/manager/test_dim.py
daliuge-engine/test/manager/test_dm.py
daliuge-engine/test/manager/test_rest.py

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

@myxie myxie changed the title WIP: Setup new Node class to improve consistency of node usage in dlg-engine Setup new Node class to improve consistency of node usage in dlg-engine Sep 18, 2024
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @myxie - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider making the Node class immutable to prevent unexpected modifications. This could be achieved by using properties with only getters or by making it a dataclass with frozen=True.
  • Add more robust error handling and validation in the Node class constructor, such as checking if ports are within valid ranges.
  • Increase unit test coverage for the Node class to ensure correct behavior under various input scenarios and edge cases.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟡 Testing: 3 issues found
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

EVENTS_PORT = 2
RPC_PORT = 3

class Node:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider adding validation for port numbers in Node class

The Node class is a good addition for standardizing node information. Consider adding validation for port numbers to ensure they are within valid ranges. This could prevent potential issues with invalid port configurations.

class Node:
    def __init__(self, events_port=EVENTS_PORT, rpc_port=RPC_PORT):
        if not 1 <= events_port <= 65535 or not 1 <= rpc_port <= 65535:
            raise ValueError("Port numbers must be between 1 and 65535")
        self.events_port = events_port
        self.rpc_port = rpc_port

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback, we will generate more comments like this in the future according to the following instructions:

- Include comments that suggest improvements or enhancements to the code, such as adding validation checks.
- Ensure the comment provides a clear rationale for the suggestion, explaining potential issues that could be avoided.
- Provide specific examples or code snippets to illustrate the suggestion, making it easier for the developer to understand and implement the change.

daliuge-engine/test/manager/test_mm.py Show resolved Hide resolved
daliuge-engine/test/manager/test_dim.py Show resolved Hide resolved
daliuge-engine/test/manager/test_rest.py Show resolved Hide resolved
with NodeManagerClient(host=node.host, port=node.port) as dm:
return dm.session_status(sessionId)
except ValueError as e:
raise Exception(f"{node_str} not in current list of nodes:", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change
raise Exception(f"{node_str} not in current list of nodes:", e)
raise Exception(f"{node_str} not in current list of nodes:", e) from e

with NodeManagerClient(host=node.host, port=node.port) as dm:
return dm.graph(sessionId)
except ValueError as e:
raise Exception(f"{node_str} not in current list of nodes:", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change
raise Exception(f"{node_str} not in current list of nodes:", e)
raise Exception(f"{node_str} not in current list of nodes:", e) from e

return dm.graph_status(sessionId)

except ValueError as e:
raise Exception(f"{node_str} not in current list of nodes:", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change
raise Exception(f"{node_str} not in current list of nodes:", e)
raise Exception(f"{node_str} not in current list of nodes:", e) from e

Comment on lines +146 to 150
if isinstance(host, Node):
endpoint = (host.host, port)
else:
endpoint = (host, port)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace if statement with if expression (assign-if-exp)

Suggested change
if isinstance(host, Node):
endpoint = (host.host, port)
else:
endpoint = (host, port)
endpoint = (host.host, port) if isinstance(host, Node) else (host, port)

if return_tuple:
return "localhost", 5553 + n, 6666 + n
else:
return Node(f"localhost:{8000}:{5553+n}:{6666+n}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Simplify unnecessary nesting, casting and constant values in f-strings (simplify-fstring-formatting)

Suggested change
return Node(f"localhost:{8000}:{5553+n}:{6666+n}")
return Node(f"localhost:8000:{5553 + n}:{6666 + n}")

@myxie
Copy link
Collaborator Author

myxie commented Sep 18, 2024

Hi @awicenec - I finally got the multiple-NodeManagers working locally, by using this approach to fix the default-ports issue. Hopefully there is enough information in the PR description; please let me know if you need me to clarify things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented lines.

Copy link
Contributor

@awicenec awicenec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty massive amount of changes, but I think we are going in the right direction. For an even more robust solution we will need some registration DB, like Ray is using, which keeps information about all managers involved in the deployment.

@myxie myxie merged commit 4be953f into master Oct 22, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants