Runners failing due to "failed to create new OS thread" #330

sb10 · 2020-07-02T14:24:53Z

rm -fr ~/.wr_development out && /tmp/wr manager start -s local --deployment development --debug -f 2> out

[from second shell]
perl -e 'for(1..2000){print "echo $_\n"}' | /tmp/wr add -i 30114-qc_genotype --cpus 0 -m 20M -o 2 -r 0 --cwd /tmp --cwd_matters --deployment development && sleep 15 && echo "getting status...\n" && /tmp/wr status -i 30114 -z -o c --deployment development; grep -c "completed job" ~/.wr_development/log; grep -c "failed to" ~/.wr_development/log

It bombs out with:

runtime: failed to create new OS thread (have 1306 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

ulimit -u
1828079

Not something I can really fix, so I set 0 core jobs to run at most double core count.

Couldn't replicate any problems with this. Reverting back to unlimited clients:
git checkout 27d9335
also couldn't replicate the problem on an m2.3xlarge after 10 attempts.
Likewise on s2.3xlarge and o2.3xlarge (which needs more time to work).
Reconfirmed on an m1.3xlarge. But this has 54 cores and all the others have 30. Need to try with m1.2xlarge (26 cores), or m2/s2/o2.4xlarge with 60 cores. It happened on an m2.4xlarge. It did not happen with the m1.2xlarge.

Even when it works, the fork issues in runners mean that things go delayed and it takes a lot longer to actually get everything to complete, compared to a 30 cpu machine. So limiting the number of runners is a legit thing to do.

Out of curiosity though, is it a go runtime bug, or an Ubunutu bug? Trying with other operating systems:

wr cloud deploy -f m2.4xlarge -m 1 -o cirros-0.3.5-x86_64-disk.img -u cirros
EROR[08-21|14:58:06] failed to launch a server in openstack: cloud server never became ready to use
[can be ssh'd to, nothing in /tmp]
wr cloud deploy -f m2.4xlarge -m 1 -o CentOS-7-2019-01-28 -u centos
runtime: failed to create new OS thread (have 363 already; errno=11)
wr cloud deploy -f m2.4xlarge -m 1 -o debian-9.8.2-20190303 -u debian
The status command failed with:
runtime: failed to create new OS thread (have 9 already; errno=11)
!!! And then worked on the second attempt.
Another status command failed with:
runtime: failed to create new OS thread (have 2 already; errno=11)
[...]
-bash: fork: retry: Resource temporarily unavailable
Did get the manager to fail with:
runtime: failed to create new OS thread (have 990 already; errno=11)

Worth looking at this again to see if there's any way to deal with these issues.

The text was updated successfully, but these errors were encountered:

sb10 · 2020-07-02T14:28:38Z

These tests pretty reliable reproduce the scheduler locking up and the manager dropping dead with no logged errors, despite adding the guard around the use of the reserve channel.

wr cloud deploy -f m1.3xlarge -m 1
[ssh there]
/tmp/wr manager stop --deployment development
make && scp -i /nfs/users/nfs_s/sb10/.wr_development/cloud_resources.openstack.key /nfs/users/nfs_s/sb10/go/bin/wr [email protected]:/tmp/wr

rm -fr ~/.wr_development && /tmp/wr manager start -s local --deployment development --debug && perl -e 'for(1..2000){print "echo $_\n"}' | /tmp/wr add -i 30114-qc_genotype --cpus 0 -m 20M -o 2 -r 0 --cwd /tmp --cwd_matters --deployment development && sleep 63 && echo "getting status...\n" && /tmp/wr status -i 30114 -z -o c --deployment development

Trying again after raising ulimit. That seemed to help a bit, but still get errors and manager deaths.

syslog shows no problems until maybe everything complete, then a bunch of runners start and exit due to queue being empty, then there's a whole bunch of "receive time out" and "jobqueue Connect(): could not reach the server" starting at 13:30:37.

sb10 added the bug label Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners failing due to "failed to create new OS thread" #330

Runners failing due to "failed to create new OS thread" #330

sb10 commented Jul 2, 2020

sb10 commented Jul 2, 2020

Runners failing due to "failed to create new OS thread" #330

Runners failing due to "failed to create new OS thread" #330

Comments

sb10 commented Jul 2, 2020

sb10 commented Jul 2, 2020