Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected_exit_pid-ns-no-syscallbuf - fail with recent revions <defunc> #3591

Closed
GitMensch opened this issue Aug 28, 2023 · 4 comments
Closed

Comments

@GitMensch
Copy link
Contributor

ps now shows , I'm quite sure that this worked before.
this is running on 5.15.0-103.114.4.el9uek.aarch64, and the plain recording and replay does work, but not the no-syscallbuf test

 bin/rr record bin/unexpected_exit_pid_ns
rr: Saving execution to trace directory `/home/opc/.local/share/rr/unexpected_exit_pid_ns-1'.
off-main-thread id = 181765
sending SIGKILL to 181764
Sleeping...
EXIT-SUCCESS

$ bin/rr replay
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-10.0.2.el9
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/opc/.local/share/rr/unexpected_exit_pid_ns-1/mmap_clone_3_unexpected_exit_pid_ns...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Really redefine built-in command "jump"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:50696
Reading symbols from /lib/ld-linux-aarch64.so.1...
BFD: warning: system-supplied DSO at 0x6ffd0000 has a section extending past end of file
0x0000ffffb5d6ccc0 in _start () from /lib/ld-linux-aarch64.so.1
(rr) c
Continuing.
off-main-thread id = 181765
sending SIGKILL to 181764
Sleeping...
EXIT-SUCCESS

Program received signal SIGKILL, Killed.
0x0000000070000004 in syscall_traced ()
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.0.2.el9.aarch64
(rr) c
Continuing.
[Inferior 1 (process 181762) exited normally]
(rr) q

$ ctest -R 'unexpected_exit_pid_ns' -VV
test 788
    Start 788: unexpected_exit_pid_ns

788: Test command: /usr/bin/bash "source_dir/src/test/basic_test.run" "unexpected_exit_pid_ns" "" "bin_dir" "120"
788: Test timeout computed to be: 10000000
788: Test 'unexpected_exit_pid_ns' PASSED
7/8 Test #788: unexpected_exit_pid_ns .......................   Passed    1.31 sec
test 789
    Start 789: unexpected_exit_pid_ns-no-syscallbuf

789: Test command: /usr/bin/bash "source_dir/src/test/basic_test.run" "unexpected_exit_pid_ns" "-n" "bin_dir" "120"
789: Test timeout computed to be: 10000000

Side question: the timeout is in milliseconds, no? 10000000 then would be a timeout of 166 seconds, but even after 10 minutes the output doesn't change.
Checking with ps shows:

rr --suppress-environment-warnings --fatal-errors --check-cached-mmaps --resource-path=/home/opc/rr/build record -n ./unexpected_exit_pid_ns-k7dbvysQc
-> ./unexpected_exit_pid_ns-k7dbvysQc
---> [unexpected_exit] <defunct>

so bin/rr is waiting forever because ./unexpected_exit_pid_ns-k7dbvysQc waits forever for a process which is defunc (so it exit code needs to be queried until it is "removed from the list", no?)

@khuey
Copy link
Collaborator

khuey commented Aug 28, 2023

I've seen this too.

@GitMensch
Copy link
Contributor Author

That's possibly related to 115ae00.

Note that the result in this test is similar to what I've first seen and (without much clue) worked on for old kernels in #2939 with the final comment:

Rechecked: the "some still hanging" tests after the patch and the old kernel also hang with the new kernel.

I've already tested a bit more than I planned, the only thing that seems to work in this case with the old (and new) kernel is to use WNOHANG instead of WEXITED, check for the si_pid == 0 (= no return) and then query "something" afterwards - I just didn't find that something...

but of course this may be totally unrelated ... or not (as I've mentioned "no clue" - ok, only a bit to see that this check is related).

@rocallahan
Copy link
Collaborator

This might be fixed by #3587

@GitMensch
Copy link
Contributor Author

Yes, this issue is fixed with current master, likely by that change (one could assign the PR to this issue, for better tracking - and possibly rename it fro "DO NOT MERGE", too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants