Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: enable brokers to be added to running instances #5184

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Commits on Aug 15, 2024

  1. broker: ensure CURVE certificate has a name

    Problem: internally generated curve certs are not named,
    so overlay_cert_name() can return NULL, but a name is required
    when authorizing a cert.
    
    This API inconsistency results in extra code and confusion
    when implementing a new boot method.
    
    Use the rank as the name for internally generated certs.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    7ee4e69 View commit details
    Browse the repository at this point in the history
  2. broker: allow instance size > PMI bootstrap size

    Problem: there is no way to bootstrap a flux instance using PMI
    with ranks (initially) missing.
    
    Allow the 'size' broker attribute to be set on the command line.
    If set to a value greater than the PMI size, perform the PMI
    exchange as usual with the PMI size, but configure the overlay
    topology with the additional ranks.
    
    Since 'hostlist' is an immutable attribute that is expected to
    be set by the bootstrap implementation, set it to include placeholders
    for the ranks that haven't connected yet "extra[0-N]" so we
    get something other than "(null)" in the logs.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    2e9f9f1 View commit details
    Browse the repository at this point in the history
  3. broker: refactor bootstrap block

    Problem: the code block that selects which boot method to use
    is not very clear.
    
    Simplify code block so that the default path is clear and
    adding a boot method won't increase complexity.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    b6328d9 View commit details
    Browse the repository at this point in the history
  4. broker: add flub bootstrap method

    Problem: there is no way to add brokers to an instance
    that has extra slots available.
    
    Add support for FLUB, the FLUx Bootstrap protocol, used when
    the broker is started with
      broker.boot-server=<uri>
    
    The bootstrap protocol consists of two RPCs:
    
    1) overlay.flub-getinfo, which requests the allocation of an
    available rank from rank 0 of the instance that is being extended,
    and also retrieves the instance size and some broker attributes.
    
    2) overlay.flub-kex, which exchanges public keys with the new
    rank's TBON parent and obtains the parent's TBON URI.
    
    Assumptions:
    - all ranks have the same topology configuration
    
    Limitations (for now):
    - hostnames will be logged as extra[0-N]
    - a broker rank cannot be re-allocated to a new broker
    - a broker cannot replace one that failed in a regular instance
    - dummy resources for the max size of the instance must be configured
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    35ac794 View commit details
    Browse the repository at this point in the history
  5. broker: add flub RPC methods to overlay

    Problem: the flub bootstrap method requires broker services.
    
    Add the following services (instance owner only):
    
    overlay.flub-getinfo
      (rank 0 only) Allocate an unused rank from rank 0 and also
      return size and misc. broker attributes to be set in the new
      broker
    
    overlay.flub-kex
      (peer rank) Exchange public keys with the TBON parent and obtain
      its zeromq URI.
    
    Add overlay_flub_provision() which is called by boot_pmi.c when extra
    ranks are configured, making those ranks available for allocation.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    c03df33 View commit details
    Browse the repository at this point in the history
  6. testsuite: add coverage for instance size override

    Problem: there is no test coverage for broker bootstrap
    with a PMI size less than the actual size.
    
    Add some tests.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    9ccf30c View commit details
    Browse the repository at this point in the history
  7. testsuite: cover flub bootstrap

    Problem: there is no test coverage for adding brokers to
    a flux instance.
    
    Add some tests.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    1a59ee7 View commit details
    Browse the repository at this point in the history
  8. broker: provision dead brokers for flub replacement

    Problem: there is no way to replace a node in Flux instance
    that goes down.
    
    Call overlay_flub_provision () when a rank goes offline
    so that the flub allocator can allocate its rank to a replacement.
    Unprovision ranks when they return to online.
    garlick committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    0ba34b1 View commit details
    Browse the repository at this point in the history