-
-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched job processing (opt-in) #474
base: main
Are you sure you want to change the base?
Conversation
Things to check that I actually implemented (from my notes):
Also:
|
To enable feedback on this PR, I've released it as a canary; install via (Also judging by the sporadic CI failures, there's clearly some kind of bug/timing issue somewhere.) |
hey, after some basic testing i decided to just swing it and deployed graphile worker canary to prod while closely monitoring. we are mainly exporting jobs and they are deduped so not a lot can go wrong. results are really awesome. improvements pretty much all over the board:
current settings, choose randomly really... preset: {
extends: [WorkerProPreset],
worker: {
connectionString: DB_CONNECTION_STRING,
schema: 'graphile_worker',
sweepThreshold: 600000, // 10 minutes
concurrentJobs: 24,
// pollInterval is only relevant for scheduled jobs (after now()) and retries
pollInterval: 120_000,
maxPoolSize: 24,
localQueueSize: 24,
completeJobBatchDelay: 100, // milliseconds
failJobBatchDelay: 100, // milliseconds
},
}, |
@benjie after two weeks we just had our first crash. here are the logs:
|
Thanks for the report of the crash! There's clearly an issue preventing the tests from reliably passing too, I suspect some kind of race condition has slipped in. Did these errors not have stack traces? If they have stack traces, that would significantly help in tracking down the issue! |
Given the error is coming from here: Lines 281 to 282 in 436e299
Could you just coalesce the entryResult.reason a few lines up: Although this doesn't actually help find out the real reason the promise rejected. Is it possible your task executor is calling 'reject(null)' on the promise? |
@hillac I've fixed that line by adding a How did you determine that was the correct location? We still seem to have an issue prevent CI passing reliably. New release contains the |
@psteinroe's stack trace has the line number, worker.js:222:43. You can look in the node modules path the stack trace shows to see the js code. Or here. |
I read |
Hooray; seems that the instability in this PR was actually in the test framework rather than Worker itself. 4 passes in a row suggests I have stabilized this now, plus the tests are now faster (locally, not in CI) because they run in parallel, and they're independent of each other since they run on clean databases. |
Latest canary released: I've extracted the following PRs from this PR so that it's simpler to review: |
The latest canary is out and it contains a number of fixes to other parts of Worker. |
…ntext and share relevant types
… returning jobs fails)
…led a second time (e.g. from forcefulShutdown)
…when called a second time
…d immediately return undefined
Description
Replaces #99 and #470.
If you're at very high scale (e.g. you're running multiple Worker instances, and each instance has high concurrency) then the act of looking for and releasing jobs can start to dominate the load on the database. The PR gives the ability to configure Graphile Worker such that
getJob
(vialocalQueueSize
),completeJob
(viacompleteJobBatchDelay
) andfailJob
(viafailJobBatchDelay
) can be batched, thereby reducing this database load (and improving job throughput). This is an opt-in feature, via the following settings:localQueueSize >= 1
, Pools become responsible for getting jobs and will grab the number of jobs that you specify up front, and distribute these to workers on demand. This is done via a "Local Queue".completeJobBatchDelay >= 0
orfailJobBatchDelay >= 0
then pools are also now responsible for completing or failing jobs (respectively); they will wait the specified number of milliseconds after acompleteJob
orfailJob
call and batch any other calls made in the interrim; all of these results will be sent to the database at the same time reducing the total number of transactions.Note that enabling these features changes the behavior of Worker in a few ways:
Performance impact
If not enabled, impact is minimal.
If enabled, throughput improvement at the cost of potential latency increases.
The following results were produced with the following setup:
performance
governorBase performance:
Jobs per second: 16093.94
With
localQueueSize: 500
:Jobs per second: 35177.47
Performance with
localQueueSize: 500, completeJobBatchDelay: 0, failJobBatchDelay: 0
(note: even though the numbers are0
this still enables batching, it is just limited to (roughly) a single JS event loop tick):Jobs per second: 180684.70
You should note that the workload benchmarked here is a workload designed to put maximal stress on the database (i.e. the tasks are basically no-ops); YMMV with real-world loads.
The CPUs were configured with this script:
Security impact
Not known.
Checklist
yarn lint:fix
passes.yarn test
passes.RELEASE_NOTES.md
file (if one exists).If this is a breaking change I've explained why.Fixes #501