Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with OSBrain Communication Across Separate Machines? #341

Open
sjanko2 opened this issue Aug 29, 2019 · 14 comments
Open

Error with OSBrain Communication Across Separate Machines? #341

sjanko2 opened this issue Aug 29, 2019 · 14 comments
Assignees
Labels
Milestone

Comments

@sjanko2
Copy link

sjanko2 commented Aug 29, 2019

Hello,

Thank you for your development on this wonderful library. I have been using it for the last few years in work towards my PhD and very much appreciate the work you all do!

I have been running simulations with multiple agents connected to a local host server on one computer for a while now. This works successfully. Now, my coworker and I are attempting to put the scripts on separate computers and have them communicate through direct ethernet connection via the OSBrain library and are having difficulty. We have one computer hosting the nameserver ("Host_Computer", a Linux OS) and the other computer attempting to register with it ("Agent_Computer", a Windows OS). We get an error that indicates the following on the Agent_Computer:

OSError: [WinError 10049] The requested address is not valid in its context

and

Pyro4.errors.CommunicationError: cannot connect to ('0.0.0.0', 5020): [WinError 10049] The requested address is not valid in its context

I have included the scripts I made to illustrate the problem, and attached a screenshot of the full stacktrace on the Agent_Computer (see first screenshot). The Host_Computer runs perfectly and simply "waits" for the Nameserver to be populated with the Agent_Computer (see second screenshot).

I'd just like to note that this code works perfectly fine if run on the same computer (such as with the Multirun plug-in in Pycharm). The Agent_Computer connects to the Host_Computer and all is well. We have only run into this issue when trying to connect on multiple machines.

Any advice you can provide would be great, thanks so much!

Sam


Host_Computer.py


from osbrain import run_nameserver
from osbrain.proxy import locate_ns
import osbrain
import time

def main():
    # Windows
    # ns_sock = '127.0.0.1:1125'

    # # Linux
    ns_sock = '0.0.0.0:5020'

    osbrain.config['TRANSPORT'] = 'tcp'

    ns_proxy = run_nameserver(ns_sock)
    ns_addr = locate_ns(ns_sock)

    agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)

    while(len(agents_in_NS)<1):
        agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)
        print('Current agents in Nameserver are: %s' %agents_in_NS)
        time.sleep(1)

    print('All agents have joined!')

if __name__ == '__main__':

    main()

Agent_Computer.py


from osbrain import NSProxy
from osbrain import run_agent
import osbrain
import Pyro4

def main():
    ns_addr = '192.168.0.42:5020'

    osbrain.config['TRANSPORT'] = 'tcp'

    # Look for nameserver, if can't find then try again
    while True:
        try:
            ns_proxy = NSProxy(ns_addr)

            break

        except Pyro4.errors.CommunicationError:
            print('Pyro4.errors.CommunicationError, no nameserver found')
            print('Trying again...')

            pass

        except TimeoutError:
            print('TimeoutError, no nameserver found')
            print('Trying again...')

            pass
    
    # Join nameserver, if can't find then try again
    while True:
        try:
            # Activate and register with server
            print('Registering Agent with server...')
            agent_proxy = run_agent('Agent1', ns_addr)

            # Wait for proxy to set up
            checkstatus = agent_proxy.get_attr('_running')

            while checkstatus == False:
                checkstatus = agent_proxy.get_attr('_running')

            # Verify that agent has registered successfully
            agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)
            print('The following agents are in the server: %s' % agents_in_NS)

            break

        except KeyError:
            print('Key Error')
            print('Trying again...')

            pass

        except Pyro4.errors.CommunicationError:
            print('Pyro4.errors.CommunicationError, no nameserver found')
            print('Trying again...')

            pass

        except TimeoutError:
            print('TimeoutError, no nameserver found')
            print('Trying again...')

            pass

    print('I have joined the nameserver!')

if __name__ == '__main__':

    main()

Screenshot 1

Screenshot 2

@Peque
Copy link
Member

Peque commented Sep 2, 2019

You can bind to 0.0.0.0, which means your socket will listen on all interfaces (i.e.: your localhost but also your LAN IP address), but you cannot connect to 0.0.0.0.

Try to connect to your Linux machine's LAN IP address, which you may find running the command ifconfig from a terminal.

Also, it seems from the screenshots that you are running both scripts from the Windows machine, right? (no Linux machine)

Note that in one of the screenshots you are binding the nameserver to the localhost (127.0.0.1), which means you would not be able to connect to it from the LAN.

@Peque
Copy link
Member

Peque commented Sep 2, 2019

Also:

  • Use ns_proxy.agents() instead of osbrain.nameserver.NameServer.agents(ns_proxy)
  • Note that you do not need locate_ns in your code

😉

@Peque Peque self-assigned this Sep 2, 2019
@Peque Peque added this to the 0.7.0 milestone Sep 2, 2019
@Peque Peque added the question label Sep 2, 2019
@sjanko2
Copy link
Author

sjanko2 commented Sep 8, 2019

Hi there,

Apologies, I had switched the roles of the machines out of convenience for the screenshots. Please see new attached screenshots with the Linux machine acting as the host (bound to 0.0.0.0) and the Windows machine acting as the agent (reaching out to the IP address of the Linux computer: 192.168.0.42). The port used on both machines is 5020. As you can see, the problem persists.

Do you have any ideas on why this might be occurring?

Also, thank you for your additional advice on use of ns_proxy.agents() and the lack of need for locate_ns. I will update the code to include those changes next time I am at this computer.

Thanks so much.

Sam

HostComputerScreenshotLinux

AgentComputerScreenshotWindows

@Peque
Copy link
Member

Peque commented Sep 9, 2019

@sjanko2 Would you mind sharing the full Agent_Computer.py code? The screenshot seems to crop the code.

@sjanko2
Copy link
Author

sjanko2 commented Sep 10, 2019

Sure, it is the same code as I put in my original post. See Below.


Agent_Computer.py


from osbrain import NSProxy
from osbrain import run_agent
import osbrain
import Pyro4

def main():
    ns_addr = '192.168.0.42:5020'

    osbrain.config['TRANSPORT'] = 'tcp'

    # Look for nameserver, if can't find then try again
    while True:
        try:
            ns_proxy = NSProxy(ns_addr)

            break

        except Pyro4.errors.CommunicationError:
            print('Pyro4.errors.CommunicationError, no nameserver found')
            print('Trying again...')

            pass

        except TimeoutError:
            print('TimeoutError, no nameserver found')
            print('Trying again...')

            pass
    
    # Join nameserver, if can't find then try again
    while True:
        try:
            # Activate and register with server
            print('Registering Agent with server...')
            agent_proxy = run_agent('Agent1', ns_addr)

            # Wait for proxy to set up
            checkstatus = agent_proxy.get_attr('_running')

            while checkstatus == False:
                checkstatus = agent_proxy.get_attr('_running')

            # Verify that agent has registered successfully
            agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)
            print('The following agents are in the server: %s' % agents_in_NS)

            break

        except KeyError:
            print('Key Error')
            print('Trying again...')

            pass

        except Pyro4.errors.CommunicationError:
            print('Pyro4.errors.CommunicationError, no nameserver found')
            print('Trying again...')

            pass

        except TimeoutError:
            print('TimeoutError, no nameserver found')
            print('Trying again...')

            pass

    print('I have joined the nameserver!')

if __name__ == '__main__':

    main()

@Peque
Copy link
Member

Peque commented Sep 10, 2019

Weird... I am not able to reproduce it. Although I do not have a Windows machine at hand.

  • What happens if you try to connect from a Linux machine (being the host in a Linux machine too)
  • What happens if you bind to 192.168.0.42 instead of 0.0.0.0?

@sjanko2
Copy link
Author

sjanko2 commented Sep 26, 2019

Hi Peque,

Apologies for the delay, it has been a busy two weeks.

We connected them Linux (Host) to Linux (Agent) and it worked! See attached screenshot and code for sake of completeness of this forum post.

Can you please help us understand why this may be necessary? For our purposes, we may need to connect a Windows to Linux in the future.

Thanks!

Sam


Agent_Computer.py



from osbrain import NSProxy
from osbrain import run_agent
import osbrain
import Pyro4

def main():
    ns_addr = '192.168.0.42:5020'

    osbrain.config['TRANSPORT'] = 'tcp'

    # Look for nameserver, if can't find then try again
    while True:
        try:
            ns_proxy = NSProxy(ns_addr)

            break

        except Pyro4.errors.CommunicationError:
            print('Pyro4.errors.CommunicationError, no nameserver found')
            print('Trying again...')

            pass

        except TimeoutError:
            print('TimeoutError, no nameserver found')
            print('Trying again...')

            pass
    
    # Join nameserver, if can't find then try again
    while True:
        try:
            # Activate and register with server
            print('Registering Agent with server...')
            agent_proxy = run_agent('Agent1', ns_addr)

            # Wait for proxy to set up
            checkstatus = agent_proxy.get_attr('_running')

            while checkstatus == False:
                checkstatus = agent_proxy.get_attr('_running')

            # Verify that agent has registered successfully
            agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)
            print('The following agents are in the server: %s' % agents_in_NS)

            break

        except KeyError:
            print('Key Error')
            print('Trying again...')

            pass

        except Pyro4.errors.CommunicationError:
            print('Pyro4.errors.CommunicationError, no nameserver found')
            print('Trying again...')

            pass

        except TimeoutError:
            print('TimeoutError, no nameserver found')
            print('Trying again...')

            pass

    print('I have joined the nameserver!')

if __name__ == '__main__':

    main()

Host_Computer.py


from osbrain import run_nameserver
from osbrain.proxy import locate_ns
import osbrain
import time

def main():
    # Windows
    # ns_sock = '127.0.0.1:1125'

    # # Linux
    ns_sock = '192.168.0.42:5020'

    osbrain.config['TRANSPORT'] = 'tcp'

    ns_proxy = run_nameserver(ns_sock)
    ns_addr = locate_ns(ns_sock)

    agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)

    while(len(agents_in_NS)<1):
        agents_in_NS = osbrain.nameserver.NameServer.agents(ns_proxy)
        print('Current agents in Nameserver are: %s' %agents_in_NS)
        time.sleep(1)

    print('All agents have joined!')

if __name__ == '__main__':

    main()

192-168-0-42 on Both with Linux (Agent) 2
192-168-0-42 on Both with Linux (Host) 2

@Peque
Copy link
Member

Peque commented Sep 26, 2019

@sjanko2 What happens if you try with the Windows machine but you bind to 192.168.0.42 instead of 0.0.0.0 like you were doing before?

@sjanko2
Copy link
Author

sjanko2 commented Sep 26, 2019

I believe we tried that a few weeks ago and had the same result. We are skeptical that it may be related to the imaging of Windows from our university's IT team on our computers.

Could it have something to do with firewalls?

@Peque
Copy link
Member

Peque commented Sep 26, 2019

Unfortunately I do not have a Windows machine at hand, so I will not be able to help much with debugging. Just giving random ideas/advise hoping to find something... 😅

If you try again with Linux+Windows combination with 192.168.0.42 binding (instead of 0.0.0.0), report the results back. At least that way we can track all progress and debugging efforts here. 😊

@Peque
Copy link
Member

Peque commented Sep 26, 2019

PS: since you are using osBrain over a LAN, firewalls can definitely influence the result. It could be possible that your Windows firewall is affecting somehow the communications.

Although the fact that you got a Cannot connect to ('0.0.0.0', 5020) makes me think it may be related to binding to 0.0.0.0. You can bind to 0.0.0.0 but you cannot connect to 0.0.0.0.

Maybe binding to a specific address (192.168.0.42) lets the agent that wants to connect get the correct IP address.

@ocaballeror
Copy link
Contributor

I was able to reproduce the error, but I'm still not sure why it's happening.

The main thing I've found is that binding to 0.0.0.0 for the name server raises all kinds of problems unless you are using the same machine as host and agent. If you want to connect Linux to Windows or Windows to Linux, you will need to use your specific IP (i.e. 192.168.0.42) for the name server process.

Apart from that, there seems to be a weird issue when connecting to a Windows host (either from a Linux or a Windows agent), where the connection works just fine, but the nameserver is unable to shut down the agent, and they both block waiting for a response, rather than shutting down. Shutting down the agents before the name server works OK. I'm not sure if this is the same issue or not.

The summary is:

  • Linux host + Linux agent: OK
  • Windows host + Windows agent: Make sure you bind directly to your IP. Otherwise, OK
  • Linux host + Windows agent: Make sure you bind directly to your IP. Otherwise, OK
  • Windows host + Linux agent: Bind to your IP. Manually shut down your agents before the name server.

I will need to look deeper into the code to see what is going on with these 0.0.0.0 errors.

@ocaballeror
Copy link
Contributor

It's been a while, but I think I managed to track down the issue.

I believe we are registering agents the wrong way. When an agent is registered in the nameserver, it uses its own local address (usually localhost and some random port), and the name server will store it that way and keep it as a reference for future connections. However, that is not enough for other machines in the network to be able to communicate with it.

This is also the problem when we bind to 0.0.0.0. The nameserver will
register a new agent with that address, and any new agents wanting to
communicate with it will try to connect to 0.0.0.0, and obviously fail to do
so. This has nothing to do with the OS, and I can confirm it happens with both
Linux and Windows host/agent combinations.

We can overcome these problems by always specifying our IP address when
creating the agents, and also by making sure that the nameserver itself uses
its IP too.

A slight modification in Agent_Computer.py is enough to make this work:

... 
    # Join nameserver, if can't find then try again
    while True:
        try:
            # Activate and register with server
            print('Registering Agent with server...')
            addr = '192.168.0.49:5021'  # Or whatever IP your agent has
            agent_proxy = run_agent('Agent1', ns_addr, addr)
...

@Peque I think we should consider this a bug and find some way to fix it.

@Peque
Copy link
Member

Peque commented Nov 14, 2019

@ocaballeror Thanks for having a look at it.

I am not sure we should consider this a bug. Having to specify the address explicitly when setting up a distributed architecture does not sound like a bad idea.

An option would be to, by default, make agents bind to X.X.X.X instead of localhost if the nameserver was bound to X.X.X.X. But I am not sure that is a good idea. Opening ports by default to an external network may not be smart.

It seems right now the problem is easily fixable in user's code. We should probably update the documentation though, to make it clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants