-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
small-file performance again #4176
Comments
Do you know what file operation is affecting the performance? |
We've disabled performance monitor due to it having negative impact and we're trying to get every ounce of performance we can. I can definitely re-enable for relatively periods. I've unfortunately deleted the state dumps from prior to the remount. Going to let things run for a bit (the day) and then create new state dumps for you. Is there any way to track lock contention? Without major performance impact. Specifically I'm wondering about the brick processes, ie, will enlarging the inode hash table at the brick layer have a similar impact to what we've seen in the fuse mounts? |
What is the volume configuration? You can try perf trace -s -p <brick_pid>, I think it should not hurt the performance much. |
Hi, I'll attach perf during a quieter time and re-action the process that generally takes too long. This should give us a reasonable idea. For now so long the gluster volume info details:
Mounts happen with:
Two nodes mount against localhost, and a handful of others mount against a ucarp address floating between the two servers. |
Ok, so I've managed to ignore this for a while. iotop today clearly showed approaching of 64 threads blocking on IO. Upon strace on one of the threads, quite a bit of this was fstatat() and readdir*() related calls. The load averages on the two hosts that's hosting the bricks are consistently quite high whereas those that's merely running fuse mounts at this stage is relatively quiet. For this reason again I believe there is work to be done on the brick side. Looking at a state dump there definitely is a table:
This is also the ONLY table found in the brick process state dump. I've now proceeded to set network.inode-lru-limit to 1048576 which almost immediately caused lru_size to increase to ~700k entries., and disk IO wait to drop (even through MB/s throughput increased). So that's likely a good thing, but I do think increasing the inode table size too would be a good thing. And like the values for dentry_hashsize too (The fuse mount will also get benefit from this), so I'll try and implement that too as, if I understand correctly, that should improve readdir and lookup performance. I believe the bulk of our workload is readdir and to a degree lookup since these are the functions maildir is heavy on (it uses readdir in order to get a listing of emails in a specific folder). These don't seem quite right, but this is what's reported:
Specifically I don't get why there are relatively few lookup and readdirp calls (counter wrap?), either way, the thing that needs to be targeted here is primarily readdirp performance from the looks of it. Looking at the code, the way the options are applied on the server side differs quite a bit from the fuse side, so where it's trivial to use options to inode_table_new() to vary the size, in the case of the server things gets a bit more complicated, and either the table needs to be replaced with a new one, or a new array needs to be allocated and under lock all existing inodes listed in this be replaced. Similar will need to happen if modifying the dentry_hashsize on the fly. This should not be a frequent operation, or possibly one can rely on restarting the bricks to make the table size and dentry size options take effect on the bricks? Either way, which of the three strategies to be followed bugs me a bit currently. Yes, I'm quite happy to sacrifice RAM for reducing disk IO and reducing contention overall. A table size of 2^10 is still "only" 16MB of RAM, a very small price to pay indeed. |
To improve readdirp performance there is one more thing you can try. Instead |
from what I've read of parallel it's only really helpful in distribute cases, which is why it's not enabled here. The 4KB vs 128KB (or even larger say 256KB) sounds potentially much more useful. And yes, at this point in time I'm happy to try more things, so please, fire away with patches - even kernel if that's where the buffer sits? |
No, the parallel would be helpful for all case. We can wind a lookup call during readdir_cbk at fuse xlator instead of waiting of application/kernel to wind a lookup call. I need some time, before Monday i will share the patch. |
Thanks. OK. I've now enabled performance.parallel-readdir again. Since I'm assuming this is a client side option I'm guessing I need to remount? Also just looked at the kernel code (fs/fuse/readdir.c) and it certainly looks like it's allocating a page at a time ... would be tremendously helpful if this too can be increased to read "up to" 128KB at a time at the kernel level. I can certainly look into this too, but it looks like this concept of a page at a time here is somewhat wired into the way things are cached too, so a little worried that this may be somewhat difficult. Assuming that it's a simple change in fuse_readdir_uncached() to modify the alloc_page() call to alloc_pages() with an order value of 5 then (4KiB pages x (2^5==32) pages == 128KiB), would current glusterfs code handle that? (This I can test without affecting production quite trivially). Even if you wind an additional readdir() call (which would basically result in a read-ahead for readdir if I understand your strategy correctly? How does this compare to performance.readdir-ahead?) that would still benefit the overall performance, even if reading then happens in 32 pages at a time. I'll work on a patch for increasing the dentry_hashsize on fuse client in the meantime so long as well (I'm not sure how/where this would help to be perfectly honest, thinking through this again it's possibly just a name => inode cache, so primarily for lookup performance, which whilst problematic isn't the most problematic issue). Guidance on how to implement inode table size and dentry table sizes for the server/brick side would be appreciated. Please help my understanding if I go off the rail, I'm not intimately familiar with the glusterfs internals, but have managed to learn a fair bit over the years, but every time it feels like a steep curve all over again. |
Either parallel or readdir-ahead is causing major problems of the stuff is blocking badly on fuse kind of scenario. I've backed out performance.readdir-ahead for the time being to see if this resolves it. |
Here I did not mean to use paralel-readdir, parallel-readdir we can use only in case of dht along with readdir-ahead. I mean |
The kernel patch that I submitted to increase the kernel read via fuse from 4KB to 128KB already showing promise. Instead of application readdir() getting 14 entries on average it now gets just under 500 on every system call, which is a HUGE improvement already, but the overall system call time is definitely longer, so your patch will still be required. |
Can you please share the patch link just for reference. |
Release 11 has many (small) improvements around inode.c, readdir(p) code and memory allocations (mainly locality changes and less calls to malloc/free). I wonder if for your use case it'll make a difference. Of course, the above change looks much more critical. |
At this stage I'm inclined to think it can't get worse. If I upgrade opversion, even if it implies downtime, it's possible to downgrade again? |
There are multiple readdir improvement with this patch for fuse. 1) Set the buffer size 128KB during wind a readdir call by fuse 2) Wind a lookup call for all entries parallel by fuse_readdir_cbk 3) Keep the inode in active list after taking extra reference for 10m(600s) sothat in next lookup is served by md-cache 4) Take the reference only while table->active size is less than 1M Fixes: gluster#4176 Change-Id: Ic9c75799883d94ca473c230e6213cbf00b383684 Signed-off-by: Mohit Agrawal <[email protected]>
@jkroonza Please test the patch in your test environment first, then you can apply the patch in your production environment. To enable the readdir first you have to disable readdirp, to disable readdirp you can apply below settings on the volume gluster v set test dht.force-readdirp off The volume needs to mount like this, need to pass two arguments (use-readdirp=no to disable readdirp and readdir-optimize=yes to enable readdir optimize) mount -t glusterfs <ip_address>:/<vol_name> <mount_point> -ouse-readdirp=no -o readdir-optimize=yes |
@mohit84 why should readdirp be off? Surely the whole point of readdirp is to improve performance? Btw, with the kernel patch we're now getting io queue lengths of >40 as measured with iostat -dmx, so it looks like just making that IO buffer on fuse from kernel side bigger has made a huge difference already. But the system call response times are still highly variable. Any changes required server/brick side? Current settings for those mentioned:
So they all need to change. Will quick-read make a difference in general anyway? What is the effect of quick-read (Can't recall from memory now why we set it). |
We can use anyone to listdir either readdir or readdirp not both. It depends on application usage(ls vs ll). If application does lookup after readdir operation then readdirp or patch(readdir+parallel_lookup) is useful but if application does There are certain cases when readdir+parallel_lookup outperform readdirp when application The quick-read improves application performance if it(app) reads frequently, it fetches first 64k during lookup |
Thanks. I've disabled the quick-read so long, as I think for the majority of our cases that's irrelevant. More and more I think that readdir is for our use case the single most important operation. |
We've added 1TB of NVMe dm-cache (smq, writethrough) in front of the raid controller over the weekend, specifically on the gluster LV which is 8TB in size. Cache lines are 1MB in size. Cache has populated over the weekend and we're seeing >90% cache hit ratio on reads. I mention this because this change by itself has made a massive impact. The kernel patch I cooked last week got us to the point where the RAID controller became (from what we could see) the single largest bottleneck, which is what prompted the hardware change. And this only comes now since it took me MUCH longer than expected to get the below figures (had some issues upgrading the node in the DC I could use for testing without affecting the rest of the cluster). Performance stats: Directly on the brick:
So that's theoretically the fastest possible we can go, and is much, much faster with the NVMe in place than without. Rest of the tests were done against glusterfs 10.4 glusterd and bricks (up to this point). First tests with:
Mounts for this is done with (except where stated otherwise) fuse module patched to issue 128KB reads for readdir() via fuse rather than 4KB:
Remote site (~7ms round-trip-time, up to 150Mbps, but we see nowhere near that), with fuse 10.4
Remote site, with fuse 11.0 (unpatched).
It must be noted a previous run of the above took 238 minutes - far worse than 10.4 - the re-run was also made overnight though so this is definitely a more fair comparison. Remote site, with fuse 11.0, patched without -o use-readdirp=no -o readdir-optimize=yes
This was NOT expected. So the patch definitely worsens things. Remote site, with fuse 11.0, patched with -o use-readdirp=no -o readdir-optimize=yes (but volume settings unchanged).
Local (<1ms rtt), fuse 10.4:
Local, fuse 11.0 (unpatched):
This already looks very promising, but could be simple variance based on overall load on the rest of the cluster. But since the remote test showed similar improvement I'd say it's likely legit, and percentage wise we'll take a nearly 20% improvement. Local, fuse 11.0 (patched, with -o use-readdirp=no -o readdir-optimize=yes, volume settings unchanged).
Really not expected. The remote test ran concurrently to this and finished in a shorter timeframe on the first run? That makes little to no sense. On the second run (different time of day) it performed worse again than the initial run but more in line with the remote run. Which indicates that the latency factor has little to do with the worse performance. After changing the volume options, and restarting the bricks too (just to be on the safe side) and remounting the two test hosts we re-ran: We didn't let this complete as the overall throughput on the cluster ground to a halt. strace -T -p $(pidof find) showed getdents64() normally taking around 100ms/call, but upwards of 3.5s and even close on 4.5s. Even simple touch statements of individual files clocked in at 5 to 10s. Conclusion: We're sticking with force-readdirp but without quick-read, using unpatched 11.0. The patch seems to have unintended side effects and I can't vouch for that, some getdents() calls from userspace with find takes upwards of 3.5s (measured on the "local" node) in this configuration where otherwise we'd only see up to ~1.5s. This is especially visible on larger folders (meaning many files in the listing). We're willing to re-test once all nodes have been upgraded to 11.0 we'll re-patch one of the mount-only nodes, unless there are code changes on the brick side I missed, in which case we'll need to patch both bricks and perform tests on those nodes. |
@jkroonza Do you use slack? Please let me know if you have we can discuss on slack. |
We've added 1TB of NVMe dm-cache (smq, writethrough) in front of the raid controller over the weekend, specifically on the gluster LV which is 8TB in size. Cache lines are 1MB in size. Cache has populated over the weekend and we're seeing >90% cache hit ratio on reads. I mention this because this change by itself has made a massive impact. The kernel patch I cooked last week got us to the point where the RAID controller became (from what we could see) the single largest bottleneck, which is what prompted the hardware change. And this only comes now since it took me MUCH longer than expected to get the below figures (had some issues upgrading the node in the DC I could use for testing without affecting the rest of the cluster). Performance stats: Directly on the brick:
So that's theoretically the fastest possible we can go, and is much, much faster with the NVMe in place than without. Rest of the tests were done against glusterfs 10.4 glusterd and bricks (up to this point). First tests with:
Mounts for this is done with (except where stated otherwise) fuse module patched to issue 128KB reads for readdir() via fuse rather than 4KB:
Remote site (~7ms round-trip-time, up to 150Mbps, but we see nowhere near that), with fuse 10.4
Remote site, with fuse 11.0 (unpatched).
It must be noted a previous run of the above took 238 minutes - far worse than 10.4 - the re-run was also made overnight though so this is definitely a more fair comparison. Remote site, with fuse 11.0, patched without -o use-readdirp=no -o readdir-optimize=yes
This was NOT expected. So the patch definitely worsens things. Remote site, with fuse 11.0, patched with -o use-readdirp=no -o readdir-optimize=yes (but volume settings unchanged).
Local (<1ms rtt), fuse 10.4:
Local, fuse 11.0 (unpatched):
This already looks very promising, but could be simple variance based on overall load on the rest of the cluster. But since the remote test showed similar improvement I'd say it's likely legit, and percentage wise we'll take a nearly 20% improvement. Local, fuse 11.0 (patched, with -o use-readdirp=no -o readdir-optimize=yes, volume settings unchanged).
Really not expected. The remote test ran concurrently to this and finished in a shorter timeframe on the first run? That makes little to no sense. On the second run (different time of day) it performed worse again than the initial run but more in line with the remote run. Which indicates that the latency factor has little to do with the worse performance. After changing the volume options, and restarting the bricks too (just to be on the safe side) and remounting the two test hosts we re-ran: We didn't let this complete as the overall throughput on the cluster ground to a halt. strace -T -p $(pidof find) showed getdents64() normally taking around 100ms/call, but upwards of 3.5s and even close on 4.5s. Even simple touch statements of individual files clocked in at 5 to 10s. Conclusion: We're sticking with force-readdirp but without quick-read, using unpatched 11.0. The patch seems to have unintended side effects and I can't vouch for that, some getdents() calls from userspace with find takes upwards of 3.5s (measured on the "local" node) in this configuration where otherwise we'd only see up to ~1.5s. This is especially visible on larger folders (meaning many files in the listing). We're willing to re-test once all nodes have been upgraded to 11.0 we'll re-patch one of the mount-only nodes, unless there are code changes on the brick side I missed, in which case we'll need to patch both bricks and perform tests on those nodes. |
No I do not, but you're welcome to find jkroon on Libera.chat. Previous post got double-posted somehow. But yea, for the moment I'm backing out the patch, and leaving one client node on glusterfs 11.0. quick-read disabled. |
I will try to catch you on tomorrow, please generate statedump also next time when you will run a test case. |
Sure, I can do that tonight, where would you like state dumps? The bricks presumably, and on the test node? I'll be able to do the remote node test for you tonight with state dumps, but not the local node (where the patch has been reverted to assist with client load, and we're busy comparing CPU usage between 10.4 and 11.0 for the FUSE process at similar loads - ie, 25% of clients towards each of four nodes). |
It's ok if you can take statedump of the client from the node where you applied the patch. |
https://downloads.uls.co.za/glusterdump.10287.dump.1691089919.bz2 (8MB). |
Just FYI you are not able to use the patch completely because you have not disabled readdirp. If you want to test readdir + parallel_ lookup completely you have to disable readddirp completely as i mentioned earlier (gluster v set test dht.force-readdirp off, gluster v set test performance.force-readdirp off). If you want to disable only for the specific node you can change client volfile and disable readdirp for dht and md-cache xlator and pass the new client volifile during mount a volume. fuse.latency.READDIR=AVG:35430425.370853 CNT:52293 TOTAL:1852763233918 MIN:5418066 MAX:19377263456 |
We did disable that on the previous run. Bad performance doesn't begin to describe it. |
You need to enable "performance.md-cache-timeout 60" also to get improvement after applying the patch. By default |
md here I assume means metadata? I'm not even phased by any metadata or calls to stat(). Just consecutive getdents64 calls are worse post this patch compared to prior. |
Yes md means metadata xlator. |
getdents64, as far as I can determine, is the one causing problems. If that uses LOOKUP behind the scenes to fill the linux_dirent structures then yes I care about it, otherwise no I don't think so. From the strace's I've looked at in general the system calls are fine, but the moment the courier-imapd process starts issuing a sequence of getdents64() system calls (as a result of glibc readdir() calls - where glibc internally uses getdents64 to the kernel), then my observation is that the initial few calls is reasonably fast, thereafter it slows down. I've patched the kernel now to issue 128KB reads towards glusterfs via FUSE instead of 4KB, and that made a much, much larger difference than the patch here. So combined with disabling quick-read we seem to be (for the time being at least) at an acceptable level of performance. I think what needs to be kept in mind is that DJB has taken measures to avoid the necessity of things like stat() on maildir by encoding important information (like delivery timestamp and file size) straight into the filename. So as long as the maildir isn't corrupt (something for which I've built my own check tools) a stat() call is very seldomly required, even though courier-imapd do perform them from time to time, their performance on average seems reasonable, even though I suspect we could win a small margin overall by caching the metadata here, I don't (currently) think it's particularly relevant, at least, not in comparison with the getdents64() calls. |
Thanks for describe your case in details i don't think glibc trigger a lookup operation during getdents. IIUC in best case maildir does not trigger a lookup operation so we don;t need to wind a lookup call during readdir and you will get more improvement as compare to current readdirp because there is no problem with readdir at brick level.The readdir is fast the performance is slow while it does lookup along with readdirp or separate lookup. Let me know if you want a new patch, i can make it configurable the fuse will trigger a lookup operation during readdir only while user has enabled it. |
The other way is you can apply your kernel_patch(128K buffer_size) and disable readdirp instead of applying any gluster patch to get the best performance as per your current need. |
What are the resultant difference between readdir and readdirp (ignoring the pre-emptive lookup call)? The only thing I can imagine is filling the additional details vs not (filling of d_type), which on every filesystem I've worked with recently is provided by pretty much every underlying filesystem we will bump into? Also, given that I can guarantee the pre-conditions of the checks done by courier-imapd by using lstat (courier-imapd basically uses that to verify that no-one put "strategic" symlinks into place, and since our hosts would need to be severely compromised before that can happen it's pretty much a non-concern) I'm fairly certain I can hack out those stat calls in any case, which means that beyond readdir() - which translates into getdents64() to the kernel - the bulk of other calls should be open(). Do you think that the resulting slow getdents64() comes from one of the following:
How can this be confirmed? DISCLAIMER: I'm not intimately familiar with all the concepts under discussion, so to a large degree I'm subject to your guidance here. |
AFAIK the linux kernel does not provide any systemcall specific to readdirp, here in this case(glusterfs) it is supported by fuse. At gluster level we use below dentry structure to populate dentry list struct _gf_dirent { As we know here d_ino/d_off/d_len/d_type/d_name all are return by getdents call, for other like d_stat it is not return by getdents call so if user has enabled readdirp(default option) in that case the brick process first prepare a list of dentries in the function (posix_fill_readdir) and after that it call the function posix_readdirp_fill to fetch the metadata(to fill d_stat) I believe The getdents call does not take much time , the slowness is at the function level posix_readdirp_fill(due to disk lookup) or posix_lookup(in case of plain readdir) so if application does not need metadata in that case either plain readdir with your kernel patch or need to increase buffer size at gluster level for plain readdir you can get good improvement as compare to readdirp. To confirm the performance difference you can do one thing Additional lookup call can be trigger either by application or by kernel, you have to debug it. |
Hi, Sorry, I think the fact that English is not first language for either of us is making this harder. Let me digest on that a bit, but out of hand it looks like plain readdir really should be more than adequate for our use-case. In some cases however (I don't know which) courier-imapd does perform a sequence of lstat() calls (resluting in fstatat(..., O_NOFOLLOW) to the kernel), so I'm guessing this may be why readdirplus overall still performs better for us compared to readdir since the stat data will already (to some degree) be cached in-kernel. As far as I could determine those lstats() are purely for the purpose of checking that nobody dropped symbolic links into the maildir folders, so I could remove those (probably by way of additional code to expose this check as a configuration option), this should result in a rather major performance boost, but I would want to check my logic with upstream project first prior to doing that. For the time being quick-read seems to improve things, as does the larger kernel to usespace bugger, and most certainly the NVMe dm-cache also made a HUGE difference. |
Ack, you can check first upstream project about to avoid lookup call then you can use plain readdir call otherwise readdirp is ok. |
courier-imap PR for same has been merged, so it won't eliminate stat() calls, but it will reduce them (potentially significantly): Explanations as to WHY it might be happening is also there. |
I'm going to close this, seems the larger readdir buffer on the kernel side did the trick for us for the time being, and the patches here definitely degraded overall performance. |
Description of problem:
Reality is that small file performance really is many file-performance issues, ie, when we have a really large number of files in a single folder, or eventually just millions and even billions of files. In our case, approximately 11m inodes on a single brick, I'm not sure the total directory count, but it too will be highish. More than the dentry_hashsize table at the very least.
That's the itable stats on one of the fuse mounts just prior to remounting with a 16m inode_hashsize. I'm not sure what the active_size relates to, if dentry_hashsize this seems OK.
invalidate_size=0 is usually a good sign that performance should be OK'ish, yet we currently have huge problems w.r.t. large maildir folders. We generally aim to archive mail folders, but some users are refusing, and we feel that whilst 85k messages in a maildir is quite big, it's also not completely unreasonable. In this case it seems though that readdir() is NOT the problem, it's more that the courier-pop3d process (user normally connects using imap) is litterally trying to read ALL messages. I'll have to look into courier-pop3d as to WHY this is needed rather than reading the messages as and when needed (the client can then download the 50GB at some trickle or something, but our servers won't be stuck in this high throughput required problem).
The next question is how I would go about deploying a mechanism to also be able to tweak the inode_hashtable size for bricks (Similar to #3716).
This is currently on glusterfs 10.4.
The text was updated successfully, but these errors were encountered: