Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket file removal tests randomly failing on Travis #287

Closed
ocaballeror opened this issue Apr 5, 2018 · 7 comments
Closed

Socket file removal tests randomly failing on Travis #287

ocaballeror opened this issue Apr 5, 2018 · 7 comments
Assignees
Milestone

Comments

@ocaballeror
Copy link
Contributor

ocaballeror commented Apr 5, 2018

https://travis-ci.org/opensistemas-hub/osbrain/jobs/362494498
https://travis-ci.org/ocaballeror/osbrain/jobs/362021572
https://travis-ci.org/ocaballeror/osbrain/jobs/362012690

Occurrences

Occurrences Test
1 test_agent_close_ipc_socket_agent_blocked_nameserver_shutdown
1 test_agent_close_ipc_socket_agent_crash_nameserver_shutdown
3 test_agent_close_ipc_socket_agent_kill
@ocaballeror ocaballeror added this to the 0.7.0 milestone Apr 5, 2018
@ocaballeror ocaballeror self-assigned this Apr 5, 2018
@Peque
Copy link
Member

Peque commented Apr 5, 2018

Yeah, I just found another one. Added a table to keep track of occurrence counts.

For now it seems the failure is always when the agent is killed or crashes (if the agent is blocked, then the name server will actually kill it, so it is the same).

@ocaballeror
Copy link
Contributor Author

I also checked my ~/.osbrain_ipc/ folder and it's full of socket files, so something is definitely wrong (and has been for a while).

@Peque
Copy link
Member

Peque commented Apr 5, 2018

@ocaballeror When were they last modified? We used to have that problem before d76a1ea.

Otherwise it might mean you can actually reproduce what we are seeing in Travis. 😄

@ocaballeror
Copy link
Contributor Author

I have a few ones from February 2-7, and the next ones were created on March 27th, so that could be a good clue of where to find a possible bug.

@ocaballeror
Copy link
Contributor Author

ocaballeror commented Apr 5, 2018

I just realized I can get osbrain to keep the socket files by killing the main process with ^C before it has time to be shut down cleanly, which makes sense, but it's something we should address in #249 before it's merged.

That is probably the reason why I had so many leftover files in my system, and that also makes me think that the one socket file I supposedly got to keep on purpose by running the tests over and over was actually the result of a manually killed process. That's bad news, because it means I have not been able to reproduce the error locally, and that this issue will be a pain to debug 😓

@ocaballeror
Copy link
Contributor Author

I left the tests running overnight, with 4 simultaneously instances of pytest running agent_close_ipc_socket_agent_kill over and over in both my home PC and my VPS (single core machine, 4 instances of pytest should be enough to overload the CPU), and after a total of more than 300,000 runs, not a single execution failed 😭 😭 . I'm kinda giving up on reproducing the error, at least by conventional means.

I also tried with different values of linger, but none of them seemed to "work", since the socket files were deleted when the name server is shut down, no matter the extra linger time that was set.

The only modification I can think of is to add an extra check after closing the socket, to see if the file is still there, and remove it if it is. Something like (pseudo code):

def remove_socket_file(socket):
  if os.path.exists("whatever the path of the socket file is"):
    os.unlink("whatever the path of the socket file is")

class Agent:
  ...
  def close_all(self):
  ...
    sock.close(linger=get_linger())
    self.after(linger, remove_socket_file, sock)

@Peque
Copy link
Member

Peque commented Apr 6, 2018

I was able to reproduce it: #293

Working on a fix...

@Peque Peque mentioned this issue Apr 6, 2018
@Peque Peque closed this as completed in #294 Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants