Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iHP DRC causes segfaults randomly #1907

Closed
smunaut opened this issue Oct 21, 2024 · 30 comments · Fixed by #1910
Closed

iHP DRC causes segfaults randomly #1907

smunaut opened this issue Oct 21, 2024 · 30 comments · Fixed by #1910
Assignees
Labels

Comments

@smunaut
Copy link

smunaut commented Oct 21, 2024

When running DRC using the iHP deck, I'm getting random segfault. This is using 0.29.7 build from source in a nix environment.

I'm attached a test case. However this is random. One run might work 100% fine. The next could crash. And then sometime it reports non-existing DRC error.

The package includes both:

  • DRC deck used and the input file and the command line used to start it
  • An example of report with non existent DRC errors
  • The crash log from a place where it segfaulted.

drc_tst.zip

@klayoutmatthias
Copy link
Collaborator

Hmm ... thanks for this report.

However, I can't see the crashes on Ubuntu 24 with the official package from the download page. I also don't see anything suspicious in valgrind.

From the trace, the issue can be anything. A rough guess is a problem with the Ruby integration. Is it safe to assume the build is correctly linked against the same Ruby library that is used at runtime?

I am not familiar with nix. I am plain old distro user. I have not tried builds with nix and honestly that is not on my top priority list.

Is anyone else able to reproduce the issue?

Matthias

@smunaut
Copy link
Author

smunaut commented Oct 21, 2024

TBH I'm not all that familiar with nix either :/ But it's what's used for all OpenLane2 runs so it's everywhere ...

The issue showed up both in the github actions and on my test VM. They both use the same "nix derivation" (sort of "source package") for klayout but they were built independently on different hardware and showed the same issue.

I also can't reproduce it on my laptop (using the exact same git hash) that runs klayout natively.

@stefanottili
Copy link

stefanottili commented Oct 21, 2024

at the first attempt, it coredumped several times out of 10 times on release build of klayout M1 MacOS Sequoia using homebrew
The next 3x10 run's finished without errors ... but on the 5th attempt I got coredumps again.
Very random ...

ERROR: Signal number: 11
Address: 0x8a87eb3bd698
Program Version: KLayout 0.29.8 (2024-10-20 r22f1778a4)

Backtrace:
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_lay.0.29.8.dylib +0x1e6874 _ZN3lay25enable_signal_handler_guiEb
/usr/lib/system/libsystem_platform.dylib +0x4184 _sigtramp
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_db.0.29.8.dylib +0x141954 _ZN2db9Instances9do_insertERKNS_8InstanceERN2tl18func_delegate_baseIjEERNS5_ImEE
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_db.0.29.8.dylib +0x141954 _ZN2db9Instances9do_insertERKNS_8InstanceERN2tl18func_delegate_baseIjEERNS5_ImEE
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_db.0.29.8.dylib +0x132c5c _ZN2db17instance_iteratorINS_30TouchingInstanceIteratorTraitsEEppEv
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_db.0.29.8.dylib +0x1328c4 _ZN2db17instance_iteratorINS_30TouchingInstanceIteratorTraitsEEppEv
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_db.0.29.8.dylib +0x266f54 _ZNK2db22RecursiveShapeIterator10next_shapeEPNS_22RecursiveShapeReceiverE
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_db.0.29.8.dylib +0x10a596c _ZNK2db13DeepEdgePairs23insert_into_as_polygonsEPNS_6LayoutEjji
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_rba.0.29.8.dylib +0x99a8 _ZNK3rba22RubyStackTraceProvider11stack_depthEv
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1d429c vm_call_cfunc_with_frame_
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1be354 vm_exec_core
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1bbbd4 rb_vm_exec
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1c8f10 specific_eval
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1d429c vm_call_cfunc_with_frame_
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1be5c8 vm_exec_core
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1bbb24 rb_vm_exec
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1c6fb8 rb_funcallv_scope
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x84ce0 rb_protect
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_rba.0.29.8.dylib +0x26df4 _ZN3rba15RubyInterpreter19remove_exec_handlerEPN3gsi16ExecutionHandlerE
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_rba.0.29.8.dylib +0x28458 _ZN3rba15RubyInterpreter19remove_exec_handlerEPN3gsi16ExecutionHandlerE
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_gsi.0.29.8.dylib +0x492d4 _ZNK2tl6Recipe11descriptionEv
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_tl.0.29.8.dylib +0x75220 _ZN2tl10Executable10do_executeEv
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_lym.0.29.8.dylib +0x29284 _ZN3lym16MacroInterpreter13execute_macroEPKNS_5MacroE
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_lym.0.29.8.dylib +0x36a34 _ZNK3lym5Macro3runEv
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_lay.0.29.8.dylib +0x6b03c _ZN3lay15ApplicationBase3runEv
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/klayout.app/Contents/MacOS/klayout +0xd904
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_rba.0.29.8.dylib +0x17e4c _ZN3rba15RubyInterpreter10initializeERiPPcPFiS1_S3_E
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1d429c vm_call_cfunc_with_frame_
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1be5c8 vm_exec_core
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x1bbb24 rb_vm_exec
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x841bc rb_ec_exec_node
/opt/homebrew/Cellar/ruby/3.3.5/lib/libruby.3.3.dylib +0x84098 ruby_run_node
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/libklayout_rba.0.29.8.dylib +0x17dd8 _ZN3rba15RubyInterpreter10initializeERiPPcPFiS1_S3_E
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/klayout.app/Contents/MacOS/klayout +0xd2d0
klayout/qt5Brew.bin.macos-Sequoia-release-Rhb33Phbauto/klayout.app/Contents/MacOS/klayout +0xcc34
/usr/lib/dyld +0x6274 start

@stefanottili
Copy link

when it occurs, the coredump is triggered by "Rule pSD.d"

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

Yes, here when it doesn't crash but reports wrong DRC errors, they are also always in the pSD.d rule.

@klayoutmatthias
Copy link
Collaborator

I will try to reproduce the issue with nix. Do you have some basic build instructions for me?

Thanks, Matthias

@Kazzz-S
Copy link
Contributor

Kazzz-S commented Oct 22, 2024

Hello folks,

The table below shows another test result set on an Intel Mac using different DMGs (#1871).
Everything seems to be okay in these cases.

Srl.No. Target OS DMG file name Qt5 or Qt6 Ruby Python Number of Runs Number of Crashes
10 Sonoma (14.7) ST-klayout-0.29.7-macOS-Sonoma-1-qt5MP-RsysPsys.dmg MacPorts OS-bundled OS-bundled 10 0
11 Sonoma (14.7) with MacPorts dev. env. LW-klayout-0.29.7-macOS-Sonoma-1-qt5MP-Rmp33Pmp312.dmg MacPorts MacPorts MacPorts 10 0
12 Sonoma (14.7) with Homebrew dev. env. LW-klayout-0.29.7-macOS-Sonoma-1-qt6Brew-Rhb33Phb312.dmg Homebrew Homebrew Homebrew 10 0
13 Sonoma (14.7) with Anaconda3 dev. env. LW-klayout-0.29.7-macOS-Sonoma-1-qt5Ana3-Rana3Pana3.dmg Anaconda3 Anaconda3 Anaconda3 10 0
14 Sonoma (14.7) HW-klayout-0.29.7-macOS-Sonoma-1-qt5MP-RsysPhb311.dmg MacPorts OS-bundled Homebrew 10 0

Kazzz-S

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

@klayoutmatthias

So the easiest is to follow the instruction to install OpenLane 2 :
https://openlane2.readthedocs.io/en/latest/getting_started/common/nix_installation/index.html

except instead of using the official repo, use the tt-ihp branch of our fork : https://github.com/TinyTapeout/openlane2/tree/tt-ihp

Then when nix is installed and you're in the OL2 directory just typing nix-shell should get you into the nix environment with the klayout that's exhibiting issues.

@stefanottili
Copy link

stefanottili commented Oct 22, 2024

@Kazzz-S 10 runs might not be enough, I had 2 fails in the first 10, then none in the next 30 before a couple in the next 10.

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

Also note that ATM in my latest build I can't get it to segfault anymore ... but it reports random non-existing DRC errors half the time. So this is really weird.

I also tried removing the pSD.d rule and then it works fine.

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

As another data point, we're also observing erroneous DRC errors of the pSD.d rule in a github action that doesn't use nix and just uses the ubuntu 24.04 package from the website :

https://github.com/TinyTapeout/tt-gds-action/blob/main/orfs/action.yml#L72

No crashes, just wrong DRC results.

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

Valgrind log that shows access to a free'd block when doing db::DeepEdgePairsIterator. Which correlates with the crash log above too.

val.log

@klayoutmatthias
Copy link
Collaborator

Thanks for the valgrind log. Most likely it is a modify-while-iterating issue. But it is hard to pinpoint as the code is somewhat complex.

Modification is not direct and the problem is a recomputation of the instance quad tree.. Basically the iterators involved should lock the layout to prevent this effect.

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

I haven't confirmed yet, but I'm starting to wonder if this is not a clang vs gcc related.
All the places where is works without issues are built with gcc and both nix and osx are built with clang.

@stefanottili
Copy link

stefanottili commented Oct 22, 2024

Thanks to kazzz build scripts, running on M1 Mac with asan is as easy as adding -d to the build command.
It shows heap-use-after-free errors.

Apple clang version 16.0.0 (clang-1600.0.26.3)
Target: arm64-apple-darwin24.0.0

klayout_crash.log

@klayoutmatthias
Copy link
Collaborator

@stefanottili the log is consistent with the valgrind log.

I am using gcc as of now. I will try a build with clang and see if I can reproduce the issue.

Matthias

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

Nevermind, I built using clang natively and couldn't reproduce the issue.
I also built in nix with gcc and had a crash ...

@stefanottili
Copy link

here's the asan output

Rule pSD.c: 0 error(s)
=================================================================
==6379==ERROR: AddressSanitizer: heap-use-after-free on address 0x60700016d218 at pc 0x00011509937c bp 0x00016b12cb20 sp 0x00016b12cb18
READ of size 8 at 0x60700016d218 thread T0
    #0 0x115099378 in db::box_tree_node<db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>>::lenq(int) const dbBoxTree.h:239
    #1 0x115099110 in db::unstable_box_tree_it<db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>, db::box_tree_sel<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, db::boxes_touch<db::box<int, int>>>>::inc() dbBoxTree.h:1685
    #2 0x11505d634 in db::unstable_box_tree_it<db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>, db::box_tree_sel<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, db::boxes_touch<db::box<int, int>>>>::operator++() dbBoxTree.h:1485
    #3 0x11505d4d0 in db::instance_iterator<db::TouchingInstanceIteratorTraits>::operator++() dbInstances.cc:467
    #4 0x115a69bf0 in db::RecursiveShapeIterator::next_shape(db::RecursiveShapeReceiver*) const dbRecursiveShapeIterator.cc:746
    #5 0x115a6c674 in db::RecursiveShapeIterator::next(db::RecursiveShapeReceiver*) dbRecursiveShapeIterator.cc:688
    #6 0x114783f3c in db::RecursiveShapeIterator::operator++() dbRecursiveShapeIterator.h:763
    #7 0x11a5f26c0 in db::DeepEdgePairsIterator::increment() dbDeepEdgePairs.cc:64
    #8 0x1145e3c7c in db::generic_shape_iterator<db::edge_pair<int>>::operator++() dbGenericShapeIterator.h:369
    #9 0x117a20144 in gsi::FreeIterAdaptor<db::generic_shape_iterator<db::edge_pair<int>>>::inc() gsiIterators.h:261
    #10 0x108c186d4 in rba::method_adaptor(int, int, unsigned long*, unsigned long, bool) rba.cc:1438
    #11 0x108c1c824 in rba::method_adaptor_n(int, int, unsigned long*, unsigned long, bool) rba.cc:1464
    #12 0x108c1dba4 in unsigned long rba::method_adaptor<77>(int, unsigned long*, unsigned long) rba.cc:1470
    #13 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #14 0x105fd2350 in vm_exec_core+0x17dc (libruby.3.3.dylib:arm64+0x1be350)
    #15 0x105fcfbd0 in rb_vm_exec+0x268 (libruby.3.3.dylib:arm64+0x1bbbd0)
    #16 0x105fdcf0c in specific_eval+0x12c (libruby.3.3.dylib:arm64+0x1c8f0c)
    #17 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #18 0x105fd25c4 in vm_exec_core+0x1a50 (libruby.3.3.dylib:arm64+0x1be5c4)
    #19 0x105fcfb20 in rb_vm_exec+0x1b8 (libruby.3.3.dylib:arm64+0x1bbb20)
    #20 0x105fdafb4 in rb_funcallv_scope+0x208 (libruby.3.3.dylib:arm64+0x1c6fb4)
    #21 0x108d1df50 in rba::rb_funcall2_wrap(unsigned long) rbaUtils.cc:425
    #22 0x105e98cdc in rb_protect+0xd4 (libruby.3.3.dylib:arm64+0x84cdc)
    #23 0x108d1dad4 in rba::rba_funcall2_checked(unsigned long, unsigned long, int, unsigned long*) rbaUtils.cc:454
    #24 0x108d288c0 in rba::Proxy::call(int, gsi::SerialArgs&, gsi::SerialArgs&) const rbaInternal.cc:256
    #25 0x107268624 in gsi::Callback::call_int(gsi::SerialArgs&, gsi::SerialArgs&) const gsiCallback.h:70
    #26 0x107268178 in tl::Variant gsi::Callback::issue<tl::Executable, tl::Variant>(tl::Variant (tl::Executable::*)()) const gsiCallbackVar.h:55
    #27 0x1072674f4 in gsi::Executable_Impl::execute() gsiDeclTl.cc:626
    #28 0x106c8d980 in tl::Executable::do_execute() tlRecipe.cc:52
    #29 0x105a1a534 in lym::MacroInterpreter::execute_macro(lym::Macro const*) lymMacroInterpreter.cc:135
    #30 0x105a7229c in lym::Macro::run() const lymMacro.cc:1095
    #31 0x10d55f940 in lay::ApplicationBase::run() layApplication.cc:1338
    #32 0x104cda454 in klayout_main_cont(int&, char**) klayout.cc:410
    #33 0x108c3ec0c in rba::run_app_func(unsigned long) rba.cc:2246
    #34 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #35 0x105fd25c4 in vm_exec_core+0x1a50 (libruby.3.3.dylib:arm64+0x1be5c4)
    #36 0x105fcfb20 in rb_vm_exec+0x1b8 (libruby.3.3.dylib:arm64+0x1bbb20)
    #37 0x105e981b8 in rb_ec_exec_node+0xb8 (libruby.3.3.dylib:arm64+0x841b8)
    #38 0x105e98094 in ruby_run_node+0x5c (libruby.3.3.dylib:arm64+0x84094)
    #39 0x108c3e9dc in rba::RubyInterpreter::initialize(int&, char**, int (*)(int&, char**)) rba.cc:2372
    #40 0x104cd8568 in klayout_main(int&, char**) klayout.cc:335
    #41 0x104cd7248 in main klayout.cc:148
    #42 0x188e3c270  (<unknown module>)

0x60700016d218 is located 8 bytes inside of 72-byte region [0x60700016d210,0x60700016d258)
freed by thread T0 here:
    #0 0x1119442d4 in _ZdlPv+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x642d4)
    #1 0x11516babc in db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>::sort(db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false> const&, db::complex_bbox_tag const&) dbBoxTree.h:2196
    #2 0x11508aa88 in db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>::sort(db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false> const&) dbBoxTree.h:2003
    #3 0x11508a4a8 in db::Instances::sort_inst_tree(db::Layout const*, bool) dbInstances.cc:1441
    #4 0x1141f374c in db::Cell::sort_inst_tree(bool) dbCell.cc:750
    #5 0x1151deed8 in db::Layout::do_update() dbLayout.cc:1886
    #6 0x1154c0770 in db::LayoutStateModel::update() dbLayoutStateModel.cc:101
    #7 0x1151cbac4 in db::Layout::update() const dbLayout.cc:1786
    #8 0x1141ebb4c in db::Cell::bbox(unsigned int) const dbCell.cc:434
    #9 0x115a69384 in db::RecursiveShapeIterator::next_shape(db::RecursiveShapeReceiver*) const dbRecursiveShapeIterator.cc:721
    #10 0x115a6c674 in db::RecursiveShapeIterator::next(db::RecursiveShapeReceiver*) dbRecursiveShapeIterator.cc:688
    #11 0x114783f3c in db::RecursiveShapeIterator::operator++() dbRecursiveShapeIterator.h:763
    #12 0x11a5f26c0 in db::DeepEdgePairsIterator::increment() dbDeepEdgePairs.cc:64
    #13 0x1145e3c7c in db::generic_shape_iterator<db::edge_pair<int>>::operator++() dbGenericShapeIterator.h:369
    #14 0x117a20144 in gsi::FreeIterAdaptor<db::generic_shape_iterator<db::edge_pair<int>>>::inc() gsiIterators.h:261
    #15 0x108c186d4 in rba::method_adaptor(int, int, unsigned long*, unsigned long, bool) rba.cc:1438
    #16 0x108c1c824 in rba::method_adaptor_n(int, int, unsigned long*, unsigned long, bool) rba.cc:1464
    #17 0x108c1dba4 in unsigned long rba::method_adaptor<77>(int, unsigned long*, unsigned long) rba.cc:1470
    #18 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #19 0x105fd2350 in vm_exec_core+0x17dc (libruby.3.3.dylib:arm64+0x1be350)
    #20 0x105fcfbd0 in rb_vm_exec+0x268 (libruby.3.3.dylib:arm64+0x1bbbd0)
    #21 0x105fdcf0c in specific_eval+0x12c (libruby.3.3.dylib:arm64+0x1c8f0c)
    #22 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #23 0x105fd25c4 in vm_exec_core+0x1a50 (libruby.3.3.dylib:arm64+0x1be5c4)
    #24 0x105fcfb20 in rb_vm_exec+0x1b8 (libruby.3.3.dylib:arm64+0x1bbb20)
    #25 0x105fdafb4 in rb_funcallv_scope+0x208 (libruby.3.3.dylib:arm64+0x1c6fb4)
    #26 0x108d1df50 in rba::rb_funcall2_wrap(unsigned long) rbaUtils.cc:425
    #27 0x105e98cdc in rb_protect+0xd4 (libruby.3.3.dylib:arm64+0x84cdc)
    #28 0x108d1dad4 in rba::rba_funcall2_checked(unsigned long, unsigned long, int, unsigned long*) rbaUtils.cc:454
    #29 0x108d288c0 in rba::Proxy::call(int, gsi::SerialArgs&, gsi::SerialArgs&) const rbaInternal.cc:256

previously allocated by thread T0 here:
    #0 0x111943e94 in _Znwm+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x63e94)
    #1 0x11516cfec in void db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>::tree_sort<db::box_tree_cached_picker<db::array<db::CellInst, db::simple_trans<int>>, db::box<int, int>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, tl::vector<db::array<db::CellInst, db::simple_trans<int>>>>>(db::box_tree_node<db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>>*, std::__1::__wrap_iter<db::array<db::CellInst, db::simple_trans<int>>*>, std::__1::__wrap_iter<db::array<db::CellInst, db::simple_trans<int>>*>, db::box_tree_cached_picker<db::array<db::CellInst, db::simple_trans<int>>, db::box<int, int>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, tl::vector<db::array<db::CellInst, db::simple_trans<int>>>>&, db::box<int, int> const&, int) dbBoxTree.h:2276
    #2 0x11516bc40 in db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>::sort(db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false> const&, db::complex_bbox_tag const&) dbBoxTree.h:2200
    #3 0x11508aa88 in db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>::sort(db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false> const&) dbBoxTree.h:2003
    #4 0x11508a4a8 in db::Instances::sort_inst_tree(db::Layout const*, bool) dbInstances.cc:1441
    #5 0x1141f374c in db::Cell::sort_inst_tree(bool) dbCell.cc:750
    #6 0x1151deed8 in db::Layout::do_update() dbLayout.cc:1886
    #7 0x1154c0770 in db::LayoutStateModel::update() dbLayoutStateModel.cc:101
    #8 0x1151cbac4 in db::Layout::update() const dbLayout.cc:1786
    #9 0x114335618 in db::Layout::begin_top_down() const dbLayout.h:1611
    #10 0x11a5e373c in db::DeepEdgePairs::begin_iter() const dbDeepEdgePairs.cc:258
    #11 0x11a5e3220 in db::DeepEdgePairs::begin() const dbDeepEdgePairs.cc:246
    #12 0x1145dffc0 in db::EdgePairs::begin() const dbEdgePairs.h:238
    #13 0x117a1f098 in gsi::ConstMethodFreeIter0<db::EdgePairs, db::generic_shape_iterator<db::edge_pair<int>>, gsi::arg_default_return_value_preference>::call(void*, gsi::SerialArgs&, gsi::SerialArgs&) const gsiMethodsVar.h:1077
    #14 0x108c18348 in rba::method_adaptor(int, int, unsigned long*, unsigned long, bool) rba.cc:1419
    #15 0x108c1c824 in rba::method_adaptor_n(int, int, unsigned long*, unsigned long, bool) rba.cc:1464
    #16 0x108c1dba4 in unsigned long rba::method_adaptor<77>(int, unsigned long*, unsigned long) rba.cc:1470
    #17 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #18 0x105fd2350 in vm_exec_core+0x17dc (libruby.3.3.dylib:arm64+0x1be350)
    #19 0x105fcfbd0 in rb_vm_exec+0x268 (libruby.3.3.dylib:arm64+0x1bbbd0)
    #20 0x105fdcf0c in specific_eval+0x12c (libruby.3.3.dylib:arm64+0x1c8f0c)
    #21 0x105fe8298 in vm_call_cfunc_with_frame_+0xe8 (libruby.3.3.dylib:arm64+0x1d4298)
    #22 0x105fd25c4 in vm_exec_core+0x1a50 (libruby.3.3.dylib:arm64+0x1be5c4)
    #23 0x105fcfb20 in rb_vm_exec+0x1b8 (libruby.3.3.dylib:arm64+0x1bbb20)
    #24 0x105fdafb4 in rb_funcallv_scope+0x208 (libruby.3.3.dylib:arm64+0x1c6fb4)
    #25 0x108d1df50 in rba::rb_funcall2_wrap(unsigned long) rbaUtils.cc:425
    #26 0x105e98cdc in rb_protect+0xd4 (libruby.3.3.dylib:arm64+0x84cdc)
    #27 0x108d1dad4 in rba::rba_funcall2_checked(unsigned long, unsigned long, int, unsigned long*) rbaUtils.cc:454
    #28 0x108d288c0 in rba::Proxy::call(int, gsi::SerialArgs&, gsi::SerialArgs&) const rbaInternal.cc:256
    #29 0x107268624 in gsi::Callback::call_int(gsi::SerialArgs&, gsi::SerialArgs&) const gsiCallback.h:70

SUMMARY: AddressSanitizer: heap-use-after-free dbBoxTree.h:239 in db::box_tree_node<db::unstable_box_tree<db::box<int, int>, db::array<db::CellInst, db::simple_trans<int>>, db::box_convert<db::array<db::CellInst, db::simple_trans<int>>, false>, 100ul, 100ul, 4u>>::lenq(int) const

@stefanottili
Copy link

wsl ubuntu: Linux version 5.15.153.1-microsoft-standard-WSL2

Neither gcc nor clang shows any valgrind issues with rule pSD.d

gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
Ubuntu clang version 14.0.0-1ubuntu1.1

@Kazzz-S
Copy link
Contributor

Kazzz-S commented Oct 22, 2024

@Kazzz-S 10 runs might not be enough, I had 2 fails in the first 10, then none in the next 30 before a couple in the next 10.

@stefanottili, you are right!
I wrote a script and ran the DRC 500 (= 5 DMGs x 100) times, which resulted in 4 crashes.

Kazzz-S

repeat100-mac.zip


MacBookPro2{sekigawa} ~ (1)% clang++ --version
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: x86_64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

@smunaut
Copy link
Author

smunaut commented Oct 22, 2024

@Kazzz-S You should also check the number of DRC errors reported.

@klayoutmatthias
Copy link
Collaborator

Good news :)

I built with the clang++ 18.1.3 that comes with my Ubuntu 24 instead of gcc, and although I don't see a crash or wrong DRC output, I can see the issue in the valgrind logs!

It could also be an issue of the STL used (clang is using their own STL implementation as far as I understand).

I am able to debug the issue now and will let you know my findings.

Thanks,

Matthias

@smunaut
Copy link
Author

smunaut commented Oct 23, 2024

@klayoutmatthias Awesome ! Thanks a lot for looking into this.

@klayoutmatthias
Copy link
Collaborator

Of course I do! :) Just allow me a little while ...

@Kazzz-S
Copy link
Contributor

Kazzz-S commented Oct 23, 2024

@Kazzz-S You should also check the number of DRC errors reported.

@smunaut, I modified the batch script batch-run.py to filter out any non-zero DRC errors, as shown below, with dummy data.

> Rule OffGrid.RFMEM: 1 error(s)
> Rule OffGrid.Recog_esd: 2 error(s)
> Rule OffGrid.DigiSub: 3 error(s)
> Rule OffGrid.dfpad_pillar: 4 error(s)
> Rule OffGrid.DeepVia: 5 error(s)
> Rule OffGrid.PolyRes: 6 error(s)
> Number of DRC errors: 21
Extracted Total Error(s)=21

Then I again ran the DRC 500 (= 5 DMGs x 100) times, but this time, I only had one crash at the beginning of program execution. No DRC errors were observed. See repeat100-2.log for details.

@klayoutmatthias, thanks for your effort 😄
Please let me know if I can help you using my test environment.

repeat100-2-mac.zip


[Added on 2024-10-25]

@klayoutmatthias
Copy link
Collaborator

klayoutmatthias commented Oct 24, 2024

Here is some update: the issue seems to be caused by some interaction of Ruby's Garbage Collector and the explicit error filtering loop used in the DRC deck. I assume that some temporary layers are cleaned up during the iteration of the error shapes. Although that is technically unrelated, there are interactions as both the error shapes and the temporary layer shapes are stored in the same hierarchical structure.

That is not related to gcc vs. clang, but the compiler may make subtle differences.

A first patch is to disable the garbage collector during these loops.

I changed this function

class DRC::DRCEngine
    def find_intersecting_edges_errors(dbu_value,
                                       error_edge_pairs_90,
                                       error_edge_pairs_180,
                                       inverse_error_edge_pairs_90 = nil,
                                       inverse_error_edge_pairs_180 = nil,
                                       options = {})
... (function body) ...
    end
end

(lines 135++ in the .lydrc file)

to

class DRC::DRCEngine
    def find_intersecting_edges_errors(dbu_value,
                                       error_edge_pairs_90,
                                       error_edge_pairs_180,
                                       inverse_error_edge_pairs_90 = nil,
                                       inverse_error_edge_pairs_180 = nil,
                                       options = {})
        begin
            gc_was_disabled = GC.disable
... (original function body) ...
        ensure
            gc_was_disabled || GC.enable
        end
    end
end

so that during the runtime of the function the GC is disabled.

With this patch, I do not see issues in valgrind any longer.

I better solution was to disable the layout updates that cause the interference during the iteration. This is a C++ patch then and should prevent similar issues in the future.

Matthias

@smunaut
Copy link
Author

smunaut commented Oct 24, 2024

Hi @klayoutmatthias

I just tested the lydrc work around and it worked fine. I got through a couple hundreds of runs without spurious DRC errors or crashes :)

Thanks ! I'll use that for this tapeout while waiting for a release with the C++ patch.

Sylvain

@stefanottili
Copy link

The bugfix/issue-1907 branch fixes the asan issues on Mac M1.

@Kazzz-S
Copy link
Contributor

Kazzz-S commented Oct 25, 2024

Hello @klayoutmatthias,

Just for your information.
I have

  • rebuilt the 5 DMGs for 0.29.8.
  • modified sg13g2_maximal.lydrc as per your suggestion.

Then, I observed
Zero crash out of 1500 (= 5x 300) runs.

MacBookPro2{sekigawa} drc_tst (1)% grep "Extracted Total Error(s)=0" repeat300.log | wc -l
    1500

Kazzz-S

@klayoutmatthias
Copy link
Collaborator

@Kazzz-S and @stefanottili Many thanks for this feedback!

The patch is not perfect yet - I am observing some thread collision issues in one unit test I still need to solve. But I am confident that is not a big issue.

Best regards,

Matthias

@klayoutmatthias klayoutmatthias linked a pull request Oct 26, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants