Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SIGCONT handling on threads blocked in syscalls #3874

Conversation

KJTsanaktsidis
Copy link
Contributor

@KJTsanaktsidis KJTsanaktsidis commented Nov 9, 2024

If a process receives a SIGSTOP, we emulate the group-stop by:

  • Leaving the thread which happened to receive the SIGSTOP signal ptrace-stopped
  • Refusing to schedule any other thread until the group-stop is over

The whole group-stop is therefore emulated by rr and not actually enforced by the kernel.

When a SIGCONT is received, we need to end the group-stop. However, we can't actually know that a ptrace-stopped thread received a signal until we try and resume it. To work around this, we check /proc/tid/status's SigPnd and ShdPnd fields in the scheduler to detect when a thread that's in a group-stop has a pending SIGCONT, and so needs to be PTRACE_CONT'd so we can actually wait and receive that SIGCONT.

A problem however arises in the following case:

  • A process has at least two threads,
  • One thread "A" receives a SIGSTOP,
  • And the other thread "B" is in a blocking system call,
  • And then a process-directed SIGCONT is sent to the process,
  • And the scheduler checks if "B" is runnable before checking if "A" is runnable.

In this case, the issue is that the process-directed SIGCONT will set the bit in ShdPnd for both threads. So
t->is_signal_pending(SIGCONT) will be true for both thread A and B. The scheduler then tries to PTRACE_CONT thread B, but it's not actually in a ptrace-stop, so it all goes pear shaped (actually you get an assertion failure in t->resume_execution()).

The fix is not to perform this SigPnd/ShdPnd checking at all for threads that are not actually in a ptrace-stop. They don't need this kind of special handling, because they're actually not ptrace-stopped; when we go to try_wait on them later on, we'll notice that they received a signal, and the handling in RecordTask::signal_delivered will actualy run emulate_SIGCONT then.

Fixes #3871

If a process receives a SIGSTOP, we emulate the group-stop by:

* Leaving the thread which happened to receive the SIGSTOP signal
  ptrace-stopped
* Refusing to schedule any other thread until the group-stop is over

The whole group-stop is therefore emulated by rr and not actually
enforced by the kernel.

When a SIGCONT is received, we need to end the group-stop. However, we
can't actually _know_ that a ptrace-stopped thread received a signal
until we try and resume it. To work around this, we check
/proc/tid/status's `SigPnd` and `ShdPnd` fields in the scheduler to
detect when a thread that's in a group-stop has a pending SIGCONT, and
so needs to be PTRACE_CONT'd so we can actually `wait` and receive that
SIGCONT.

A problem however arises in the following case:

* A process has at least two threads,
* One thread "A" receives a SIGSTOP,
* And the other thread "B" is in a blocking system call,
* And then a process-directed SIGCONT is sent to the process,
* And the scheduler checks if "B" is runnable before checking if "A" is
  runnable.

In this case, the issue is that the process-directed SIGCONT will set
the bit in `ShdPnd` for _both_ threads. So
`t->is_signal_pending(SIGCONT)` will be true for both thread A and B.
The scheduler then tries to PTRACE_CONT thread B, but it's not actually
in a ptrace-stop, so it all goes pear shaped (actually you get an
assertion failure in `t->resume_execution()`).

The fix is not to perform this `SigPnd`/`ShdPnd` checking at all for
threads that are not actually in a ptrace-stop. They don't need this
kind of special handling, because they're actually not ptrace-stopped;
when we go to `try_wait` on them later on, we'll notice that they
received a signal, and the handling in `RecordTask::signal_delivered`
will actually run `emulate_SIGCONT` then.
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_sigcont_threaded branch from 0bc299a to 6862f7c Compare November 9, 2024 04:04
@rocallahan rocallahan merged commit b1e461a into rr-debugger:master Nov 9, 2024
5 checks passed
@rocallahan
Copy link
Collaborator

Thanks! This is very tricky stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program
2 participants