Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think something got problem on groups feature when server crash #44

Open
for2gles opened this issue Mar 22, 2023 · 14 comments
Open

I think something got problem on groups feature when server crash #44

for2gles opened this issue Mar 22, 2023 · 14 comments

Comments

@for2gles
Copy link

for2gles commented Mar 22, 2023

Hi

I set
group concurrency: 1
global concurrency: 15
And during the process one job and server got refreshed for some reason.
And it seems get back to waiting, but it never get active.
Even though I delete the stalled job and add again with same group id but the job is not work.

Seems that specific group id get stuck and doesn't get back to normal when server get crash during process.

@for2gles
Copy link
Author

When I check the group status, it return maxed but it's not processing at all

@manast
Copy link
Contributor

manast commented Mar 22, 2023

Which version of BullMQ Pro are you using?

@manast
Copy link
Contributor

manast commented Mar 22, 2023

So to summarize your issue:

  1. A job belonging to a group G was processing.
  2. the server was restarted while the job was being processed.
  3. the job has been correctly moved to wait, but the group is still in "maxed" status.
  4. no other job is being processed (active) in that group.

Can you confirm?

@for2gles
Copy link
Author

Yes true
I'll explain the detailed situation

  1. I'm using 5.1.14 version of BullMQ Pro
  2. I'm also using 1.76.6 version of the BullMQ together for the legacy queue
  3. So I also use QueueSchedular. And I'm still making QueueSchedular for old queue(I don't know it's affect)
  4. The node server is on docker and it's restart by CI/CD
  5. So weird thing is retrying it self is work from my local when I kill the node pid(I don't use docker from local). It's keep process the another job properly.(I used local redis when I test)

Please ask me anything if you need more info.
BTW this is almost 8PM so I may reply tomorrow.
thank you 🙏

@manast
Copy link
Contributor

manast commented Mar 22, 2023

Ok, an explanation for this behavior could be that the standard BullMQ (not Pro) is actually also using the new queues. For example, if a Pro worker crashes or is re-started, the standard BullMQ could move that job to wait, but since it does not know about groups, the group will stay at "maxed".
Also I wonder, why not upgrading to latest BullMQ, or even better only using BullMQ Pro for all queues? (with the newest version you do not need the QueueScheduler either so it is easier)

@manast
Copy link
Contributor

manast commented Mar 22, 2023

Another thing. By any chance, do you share a Redis connection between the standard and the Pro version?

@for2gles
Copy link
Author

for2gles commented Mar 23, 2023

Ok, an explanation for this behavior could be that the standard BullMQ (not Pro) is actually also using the new queues. For example, if a Pro worker crashes or is re-started, the standard BullMQ could move that job to wait, but since it does not know about groups, the group will stay at "maxed".

I don't think so.. because I completely split between the standard BullMQ Queue-Worker and BullMQ Pro Queue-Worker.
So I don't think it's possible some job is made by standard BullMQ Queues.

Also I wonder, why not upgrading to latest BullMQ, or even better only using BullMQ Pro for all queues? (with the newest version you do not need the QueueScheduler either so it is easier)

It's because I just don't want to change it because it is already working well.

Another thing. By any chance, do you share a Redis connection between the standard and the Pro version?

No I differentiate the connection. I don't know why but I was not even start the node server when I use same connection.
This is my comment
BTW do you know why it's not possible to use same connection?

@manast
Copy link
Contributor

manast commented Mar 23, 2023

BTW do you know why it's not possible to use same connection?

Because when BullMQ starts it loads a bunch of lua scripts with that connection, and I think if you use two different versions with the same connection the scripts get mixed up.

@manast
Copy link
Contributor

manast commented Mar 23, 2023

By connection I mean a IORedis instance btw.

@manast
Copy link
Contributor

manast commented Mar 23, 2023

Also, I am releasing a "repairMaxedGroup" function to Pro and exposing it in Taskforce.sh so that you can fix the maxed groups manually. This should never happen, but at least if it happens know you can do something about it. We will need to investigate it further to discover the cause behind it.

@manast
Copy link
Contributor

manast commented Mar 23, 2023

It is released now, please give it a try:

image

@for2gles
Copy link
Author

for2gles commented Mar 24, 2023

Oh, Thank you. It works!
BTW is the function heavy process?

I just made worker to run every 10 minutes to check every BullMQ Pro Queue like below

  1. Get Groups list for each BullMQ Pro
  2. filter only the status is maxed
  3. run the function that you released

@manast
Copy link
Contributor

manast commented Mar 24, 2023

The function is not designed to be used frequently as this issue should never happen :) If you are able to produce this issue frequently, then please provide some code that reproduces it and we will fix it instead.

@for2gles
Copy link
Author

OK Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants