Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nameserver crashes unexpectedly #318

Open
ocaballeror opened this issue Nov 30, 2018 · 3 comments
Open

Nameserver crashes unexpectedly #318

ocaballeror opened this issue Nov 30, 2018 · 3 comments
Assignees
Milestone

Comments

@ocaballeror
Copy link
Contributor

The nameserver crashed on shutdown and I could not restart it because it was left hanging, waiting for a rogue agent to shut down, which apparently is the expected behavior.

Surprisingly enough the error message shown was:

TimeoutError: Chances are [] were not shutdown after 10.0 s!

So it would appear like the agent was still alive after the call to async_kill_agents but it effectively died in the milliseconds between us checking if it was alive and the TimeoutErrror being raised just after that. I find it very very strange, especially considering that we set a default timeout of 10 seconds, which should be plenty for any kind of agent to shut down.

It probably has something to do with the agent being unresponsive and having broken the connection between it and the nameserver, but it's hard to know for sure until we can get a reproducible case.

@ocaballeror ocaballeror self-assigned this Nov 30, 2018
@ocaballeror
Copy link
Contributor Author

We'll have to experiment by making the agents crash in different ways until we find a situation that we can reproduce.

@Peque Peque added this to the 0.7.0 milestone Nov 30, 2018
@Peque
Copy link
Member

Peque commented Mar 11, 2019

Maybe unrelated, but I was able to reproduce a crash like that (only the list of agents was not empty) in my pypy branch with:

tox -e pypy3 -- -xsv -k close_ipc_socket_agent_blocked

@ocaballeror
Copy link
Contributor Author

To be fair, there are quite a few things that don't seem to work with pypy, so I'm not sure if this counts as "reproducing" the error.

My guess from a few minutes of running this is that pypy must handle threads in a different way than what we are used to. The ContextTerminated errors that pop up when running this test certainly look like the context is being terminated before we expected.

What is happening on pypy reminds me of this other test I wrote when I first tried to reproduce the error.
The agent ends up in a very wrong state, and the output looks kind of similar:

def test_agent_break():
    def break_internals(agent):
        agent._context.term()

    ns = run_nameserver()
    agent = run_agent('agent')
    agent.set_method(break_internals)
    agent.after(0, 'break_internals')
    time.sleep(.1)
    ns.shutdown(3)

    assert agent_dies('agent', ns)

I still haven't found a way to reproduce the original Chances are [] were not shutdown error 😞. There could be many factors involved, but what exactly happened is still beyond me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants