-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High io_wait CPU usage #247
Comments
Probably async-profiler can help there, but first, quick check:
This change b047cec#diff-e660183916ec399e605aac22a0864bd4b02ee986c85f5d96256872b5cf7f6783R49 should have exactly the opposite effect of what you're seeing and it should be within |
I've also double-checked and it turned out what I've been using io_uring v 0.0.24. I switched it to 0.0.25 and it didn't improve io_wait times. Surprisingly I see the same picture even when the service is absolutely idle - when no traffic is routed to it and no background tasks are running. Even in this case the io_wait time is above 90%.. |
Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo). |
I would profile with async profiler or some perf top -a |
Yep, it seems that both are waiting. Meaning that there is not enough network activity to perform further I/O |
Thanks for all of the suggestions and answers! And in case of the load, the one which can normally handled by the epoll, does the absence of a file zero copy transfer explain timeouts, in both service and even in http metrics endpoints? Or could it explain it, at least theoretically? Does it mean that io_uring performance gains are not utilized when working with files (writing / reading)?
|
When switching an application to use io_uring as a transport, we observed a huge increase in the CPU io_wait time, going from 0.01% to ~95% for the same load. Overall CPU usage however stays low, around 4% per core, as you can see on the charts below.
The app was running on a relatively fresh OS and kernel version.
Is such increase in io_wait expected? What is the best way to further diagnose it? Recording JFR didn't give much, the Socket IO events are missing.
As for the side effects of swtiching to io_uring, the app started to have much more timeouts - both for incoming and outgoing http requests. For example, timeouts in both serving the usual and the metrics requests are observed, even the later is supposed to be fast.
IO_uring enabled:
IO_uring disabled:
JRE vendor/version: java-21-amazon-corretto
Exact version: 21.0.3+9-LTS
IO_uring version: 0.0.25.Final, x86-64
Netty version: 4.1.107-109
OS version: Ubuntu 22.04.4 LTS, at AWS EC2
Kernel version: 6.5.0-1018-aws
The text was updated successfully, but these errors were encountered: