-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when assigning Timer #4369
Comments
I am able to replicate. The commit ( 6806b6b ) where we switched to LLVM 15 introduces the bug. Not it only happens in debug mode, to release mode. |
Here's some wonderful LLDB output done with a debug version of the runtime:
It doesn't make much sense. I'm not sure what is is saying it is segfaulting on as chunk isn't null. and isn't address 0x0.
|
So my suspicion based on what I can see... this works in release mode because "the bad code" is being optimized away in release mode. Specifically, I assume that the heap allocation is optimized away by the |
That suspicion appears to be incorrect as I built a version of the compiler with |
Ok, I found a change that fixes the issue, but it is a symptom not a cure. having that specifically activates this code: if(c->opt->strip_debug) {
PB.registerOptimizerLastEPCallback(
[&](ModulePassManager &mpm, OptimizationLevel level) {
mpm.addPass(StripSymbolsPass());
}
);
} which appears to be "fixing" the crash. I suspect that we probably are generating some bad LLVM that wasn't problematic prior to LLVM 15 or was introduced during our shift to LLVM 15. |
The LLDB output posted in Sean's comment above is wacky in a few ways:
Maybe the debug table is being corrupted somehow? |
Sean and I will discuss this more on Friday and maybe do some LLVM IR spelunking together. |
Minimal repro so far: use "collections"
actor Main
new create(env: Env) =>
var i: USize = 0
while i < 100000 do
let t1 = Timer
OurTimers(consume t1)
let t2 = Timer
OurTimers(consume t2)
i = i + 1
end
actor OurTimers
new create() =>
None
be apply(timer: Timer iso) =>
None
class Timer
embed _node: ListNode[Timer]
new iso create() =>
_node = ListNode[Timer]
try _node()? = this end |
Final minimal repro: actor Main
new create(env: Env) =>
var i: USize = 0
while i < 100000 do
let t1 = Timer
OurTimers(consume t1)
let t2 = Timer
OurTimers(consume t2)
i = i + 1
end
actor OurTimers
new create() =>
None
be apply(timer: Timer iso) =>
None
class Timer
embed _node: TimerListNode
new iso create() =>
_node = TimerListNode
try _node()? = this end
class TimerListNode
var _item: (Timer| None)
new create(item: (Timer | None) = None) =>
_item = item
fun ref update(value: Timer): Timer^ ? =>
error |
This is the same issue as #3874 except that it is also appearing on X86 not just aarch64. |
Further refinement: actor Main
new create(env: Env) =>
var i: USize = 0
while i < 1_000_000 do
let t1 = Timer
let t2 = Timer
OurTimers(consume t1)
i = i + 1
end
actor OurTimers
new create() =>
None
be apply(timer: Timer iso) =>
None
class Timer
embed _node: TimerListNode
new iso create() =>
_node = TimerListNode
try _node()? end
class TimerListNode
fun apply() ? =>
error I'm no longer sure it is the same issue as #3874 but it has the same "slight change to address". With 3874 this was reproducible every time. This often works just fine on x86 and we need to do it often to get the issue and even then, its not guaranteed to happen. |
Here's a version that seems to be fully reproducible on my machine without looping which could make it #3874 again. actor Main
new create(env: Env) =>
let t1 = Timer
let t2 = Timer
OurTimers(consume t1)
actor OurTimers
new create() =>
None
be apply(timer: Timer iso) =>
None
class Timer
embed _node: TimerListNode
new iso create() =>
_node = TimerListNode
try _node()? end
class TimerListNode
fun apply() ? =>
error |
@ponylang/committer can you test on your machines and see what happens with that program using ponyc 0.55.0 and report if it works or crashes and what your particular setup is in either case? |
I've discovered that our minimal repros do not always crash, in fact, I cant make most of my minimal reproductions crash if I build ponyc myself but I can with the original so let's go with the original. |
I cant reproduce this in CI except for the musl x86 builds. So very weird as I am reproducing locally on x86 glibc with ubuntu 22.04. |
Results with minimal repo of Aug 5, 2023, 1:15 AM GMT+1 Windows 10 Pro Version 22H2 ponyc 0.54.1 [release] ponyc 0.55.0 [release] Fedora Linux 38 (Workstation Edition) ponyc 0.54.1-c41393ce [release] ponyc 0.55.0-2dc8d01c [release] Ubuntu 23.04 ponyc 0.55.0-6c9b027a [release] |
@jemc what are the steps I should take to try using the "llvm tools" to link the program to see if it fixes the crash for me? |
I tried this with the x86 limux glibc environment being on GH now... still can't get a reproduction from |
@NickJT can you try retesting with |
When setting up MacOS x86 CI, I found that the |
The problem appears to be localized to using release versions of the runtime and then doing a debug compile of a pony program. Stripping debug symbols fixes the problem. Hypothesis: the debug info in the binary we generate with |
StripDebugInfo(M);
StripSymbolNames(M, false);
|
The interesting question here is why does changing the optimization level also fix the issue. Does it strip debug info? |
Not running |
Turning off |
and in there, removing if (EnableModuleInliner)
MPM.addPass(buildModuleInlinerPipeline(Level, Phase));
else
MPM.addPass(buildInlinerPipeline(Level, Phase)); fixes our problem. note that both branches appear that they would cause us issues, but it appears only the |
Turning on LTO fixes the issue for me for one of the smaller reproductions but doesn't fix it for the original. |
All debugging I did this evening was with: use @pony_exitcode[None](code: I32)
actor Main
new create(env: Env) =>
let t1 = Timer
let t2 = Timer
OurTimers(consume t1)
@pony_exitcode(0)
actor OurTimers
new create() =>
None
be apply(timer: Timer iso) =>
None
class Timer
embed _node: TimerListNode
new iso create() =>
_node = TimerListNode
try _node()? end
class TimerListNode
fun apply() ? =>
error except for a last test for LTO where I used the original code when the issue was opened. |
Turning off if (EnableModuleInliner)
MPM.addPass(buildModuleInlinerPipeline(Level, Phase));
else
MPM.addPass(buildInlinerPipeline(Level, Phase)); also fixes the original version. |
Using a debug runtime with |
Results with minimal repo of Aug 5, 2023, 1:15 AM GMT+1 Fedora Linux 38 (Workstation Edition) ponyc 0.55.0-2dc8d01c [release] ponyc --debug : segfault Ubuntu 23.04 ponyc 0.55.0-6c9b027a [release] ponyc --debug : segfault |
I found what in the switch to LLVM 15 caused this issue to crop up. I still have a few things to discuss with Joe, but, we changed the optimizations run for I can create this issue with LLVM 14 by having inliner passes for optimization levels above -O0 run. |
When we upgraded to LLVM 15, we accidentally changed our optimization level for programs compiled with --debug. This uncovered a pre-existing condition and caused issues like #4369. This takes us back to LLVM 14 level but doesn't "fix" the underlying issue which we are continuing to work on.
When we upgraded to LLVM 15, we accidentally changed our optimization level for programs compiled with --debug. This uncovered a pre-existing condition and caused issues like #4369. This takes us back to LLVM 14 level but doesn't "fix" the underlying issue which we are continuing to work on.
When we upgraded to LLVM 15, we accidentally changed our optimization level for programs compiled with --debug. This uncovered a pre-existing condition and caused issues like #4369. This takes us back to LLVM 14 level but doesn't "fix" the underlying issue which we are continuing to work on.
When we upgraded to LLVM 15, we accidentally changed our optimization level for programs compiled with --debug. This uncovered a pre-existing condition and caused issues like #4369. This takes us back to LLVM 14 level but doesn't "fix" the underlying issue which we are continuing to work on.
This is "fixed" in 0.55.1 but the underlying issue remains. #4401 has a "fix" that comes closer to addressing the underlying issue (but at the moment is just an investigation). |
@NickJT did you build those |
Yes, both instances of ponyc were built from source
…On Fri, 25 Aug 2023, 23:20 Sean T Allen, ***@***.***> wrote:
Results with minimal repo of Aug 5, 2023, 1:15 AM GMT+1
*Fedora Linux 38 (Workstation Edition)* Intel® Core™ i5-6500T 16.00GB
ponyc 0.55.0-2dc8d01c [release] Compiled with: LLVM 15.0.7 --
Clang-16.0.6-x86_64 Defaults: pic=true
ponyc --debug : segfault ponyc --debug --strip : runs to completion
*Ubuntu 23.04* Intel® Core™ i7-4770 16.00GB
ponyc 0.55.0-6c9b027a [release] Compiled with: LLVM 15.0.7 --
Clang-15.0.7-x86_64 Defaults: pic=true
ponyc --debug : segfault ponyc --debug --strip : runs to completion
@NickJT <https://github.com/NickJT> did you build those that you used to
test yourself?
—
Reply to this email directly, view it on GitHub
<#4369 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2DWSTIGZB72KHEBCIU753XXEQLZANCNFSM6AAAAAA272HZ2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@NickJT did you do a release or debug configuration when building the compiler? |
It was a release build in both cases (following the instructions here:
https://github.com/ponylang/ponyc/blob/main/BUILD.md)
…On Sat, 26 Aug 2023, 12:30 Sean T Allen, ***@***.***> wrote:
@NickJT <https://github.com/NickJT> did you do a release or debug
configuration when building the compiler?
—
Reply to this email directly, view it on GitHub
<#4369 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2DWSW7A4F3XMGNURNLK7LXXHM6VANCNFSM6AAAAAA272HZ2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@NickJT can you update to commit 3bafb91 and rebuild and verify |
I've learned a lot about weirdness we are seeing around this. For example, builds starting with the August 9 nightly do not segfault. This corresponds with switching from Cirrus to GH and a new builder image for releases. However that isn't what appears to "fix" the issue when using a release runtime... 277329e changes the nature of the boom so on glibc Linux it appears to only happen with debug runtime versions. Another weirdness was "why is this passing for CI with the regression". That turns out to be a bug that needs to be investigated in our running of the regression-4369 test on glibc linux. It passes CI when using a debugger, but fails otherwise (assuming you are on a branch that 'unfixes' the fix we slapped onto main while investigating further. See https://ponylang.zulipchat.com/#narrow/stream/190359-ci/topic/runner.20on.20x86.20glibc.20CI So in a bit of "sanity", this appears to now be back to run of the mill, "wtf" optimization insanity related in some fashion to some debug info. You know, really hard, but not completely off the wall. Investigation continues. |
I hope this is what you asked me to test, but commit 3bafb91 gives a segfault with -d (though not with -d --strip). I've included the relevant sections of the terminal log below (just in case I missed something). With the latest commit (66c74e6) the minimal example runs to completion just with -d. Fedora Linux 38 (Workstation Edition) update to commit 3bafb91 [nick@pete ponyc]$ git checkout 3bafb91 configure/build/install... [nick@pete ponyc]$ ponyc -v with latest commit [nick@pete ponyc]$ git checkout 66c74e6 [configure/build/install...] [nick@pete ponyc]$ ponyc -v |
While looking into at a related issue, I ran the regression test for this under
Given the noise, I'm not sure how much stock to put into this, however, the It is possible something we are doing makes backtracing in gdb not work, but, it is worth someone looking into. Note that
|
After extensive investigation of the weirdness that is #4369, Joe and I believe the issue is most likely an LLVM bug. However, we don't have time to pursue further. During the investigation, I came to believe that the change to use CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations would be prudent. It prevents the known instances of bugs like #4369 which is nice but not key to this change. Changes things because they make symptoms go away isn't a good strategy if you don't understand why. We do not "fully" understand the impact of this change on bugs like #4369. However, it appears that CodeGenOpt::None is a code path that sees little use or testing and is more prone to bugs/regression. Given that we have seen more than 1 bug that "smells like" #4369, we felt it was wise to switch to Default from None at least until such time as we understand #4369 and #3874. LLVM is a complicated beast and using code paths that are little used can be a dangerous proposition. This commit doesn't include release notes as we addresses to user facing instance of #4369 already.
After extensive investigation of the weirdness that is #4369, Joe and I believe the issue is most likely an LLVM bug. However, we don't have time to pursue further. During the investigation, I came to believe that the change to use CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations would be prudent. It prevents the known instances of bugs like #4369 which is nice but not key to this change. Changes things because they make symptoms go away isn't a good strategy if you don't understand why. We do not "fully" understand the impact of this change on bugs like #4369. However, it appears that CodeGenOpt::None is a code path that sees little use or testing and is more prone to bugs/regression. Given that we have seen more than 1 bug that "smells like" #4369, we felt it was wise to switch to Default from None at least until such time as we understand #4369 and #3874. LLVM is a complicated beast and using code paths that are little used can be a dangerous proposition. This commit doesn't include release notes as we addresses to user facing instance of #4369 already.
Running the following code after compiling in debug mode results in a segfault. Running the same code after compiling in release mode works as expected.
Environment:
Ponyc version:
Zulip topic: https://ponylang.zulipchat.com/#narrow/stream/189985-beginner-help/topic/segfault.20with.20-d
Original code below copied from https://github.com/ponylang/ponyc/blob/main/examples/timers/timers.pony
The text was updated successfully, but these errors were encountered: