Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small-file performance again #4176

Closed
jkroonza opened this issue Jun 12, 2023 · 42 comments · May be fixed by #4207
Closed

small-file performance again #4176

jkroonza opened this issue Jun 12, 2023 · 42 comments · May be fixed by #4207

Comments

@jkroonza
Copy link
Contributor

Description of problem:

Reality is that small file performance really is many file-performance issues, ie, when we have a really large number of files in a single folder, or eventually just millions and even billions of files. In our case, approximately 11m inodes on a single brick, I'm not sure the total directory count, but it too will be highish. More than the dentry_hashsize table at the very least.

[xlator.mount.fuse.itable]
xlator.mount.fuse.itable.dentry_hashsize=14057
xlator.mount.fuse.itable.inode_hashsize=1048576
xlator.mount.fuse.itable.name=meta-autoload/inode
xlator.mount.fuse.itable.lru_limit=1048576
xlator.mount.fuse.itable.active_size=5994
xlator.mount.fuse.itable.lru_size=1048575
xlator.mount.fuse.itable.purge_size=0
xlator.mount.fuse.itable.invalidate_size=0

That's the itable stats on one of the fuse mounts just prior to remounting with a 16m inode_hashsize. I'm not sure what the active_size relates to, if dentry_hashsize this seems OK.

invalidate_size=0 is usually a good sign that performance should be OK'ish, yet we currently have huge problems w.r.t. large maildir folders. We generally aim to archive mail folders, but some users are refusing, and we feel that whilst 85k messages in a maildir is quite big, it's also not completely unreasonable. In this case it seems though that readdir() is NOT the problem, it's more that the courier-pop3d process (user normally connects using imap) is litterally trying to read ALL messages. I'll have to look into courier-pop3d as to WHY this is needed rather than reading the messages as and when needed (the client can then download the 50GB at some trickle or something, but our servers won't be stuck in this high throughput required problem).

The next question is how I would go about deploying a mechanism to also be able to tweak the inode_hashtable size for bricks (Similar to #3716).

This is currently on glusterfs 10.4.

@mohit84
Copy link
Contributor

mohit84 commented Jun 13, 2023

Description of problem:

Reality is that small file performance really is many file-performance issues, ie, when we have a really large number of files in a single folder, or eventually just millions and even billions of files. In our case, approximately 11m inodes on a single brick, I'm not sure the total directory count, but it too will be highish. More than the dentry_hashsize table at the very least.

[xlator.mount.fuse.itable]
xlator.mount.fuse.itable.dentry_hashsize=14057
xlator.mount.fuse.itable.inode_hashsize=1048576
xlator.mount.fuse.itable.name=meta-autoload/inode
xlator.mount.fuse.itable.lru_limit=1048576
xlator.mount.fuse.itable.active_size=5994
xlator.mount.fuse.itable.lru_size=1048575
xlator.mount.fuse.itable.purge_size=0
xlator.mount.fuse.itable.invalidate_size=0

That's the itable stats on one of the fuse mounts just prior to remounting with a 16m inode_hashsize. I'm not sure what the active_size relates to, if dentry_hashsize this seems OK.

invalidate_size=0 is usually a good sign that performance should be OK'ish, yet we currently have huge problems w.r.t. large maildir folders. We generally aim to archive mail folders, but some users are refusing, and we feel that whilst 85k messages in a maildir is quite big, it's also not completely unreasonable. In this case it seems though that readdir() is NOT the problem, it's more that the courier-pop3d process (user normally connects using imap) is litterally trying to read ALL messages. I'll have to look into courier-pop3d as to WHY this is needed rather than reading the messages as and when needed (the client can then download the 50GB at some trickle or something, but our servers won't be stuck in this high throughput required problem).

The next question is how I would go about deploying a mechanism to also be able to tweak the inode_hashtable size for bricks (Similar to #3716).

This is currently on glusterfs 10.4.

Do you know what file operation is affecting the performance?
Can you please share the statedump to analyze which fop has low latency?

@jkroonza
Copy link
Contributor Author

We've disabled performance monitor due to it having negative impact and we're trying to get every ounce of performance we can. I can definitely re-enable for relatively periods. I've unfortunately deleted the state dumps from prior to the remount. Going to let things run for a bit (the day) and then create new state dumps for you.

Is there any way to track lock contention? Without major performance impact. Specifically I'm wondering about the brick processes, ie, will enlarging the inode hash table at the brick layer have a similar impact to what we've seen in the fuse mounts?

@mohit84
Copy link
Contributor

mohit84 commented Jun 13, 2023

We've disabled performance monitor due to it having negative impact and we're trying to get every ounce of performance we can. I can definitely re-enable for relatively periods. I've unfortunately deleted the state dumps from prior to the remount. Going to let things run for a bit (the day) and then create new state dumps for you.

Is there any way to track lock contention? Without major performance impact. Specifically I'm wondering about the brick processes, ie, will enlarging the inode hash table at the brick layer have a similar impact to what we've seen in the fuse mounts?

What is the volume configuration? You can try perf trace -s -p <brick_pid>, I think it should not hurt the performance much.

@jkroonza
Copy link
Contributor Author

Hi,

I'll attach perf during a quieter time and re-action the process that generally takes too long. This should give us a reasonable idea. For now so long the gluster volume info details:

Volume Name: mail
Type: Replicate
Volume ID: 2938a063-f53d-4a1c-a84f-c1406bdc260d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: server1:/mnt/gluster/mail
Brick2: server2:/mnt/gluster/mail
Options Reconfigured:
performance.write-behind: off
performance.write-behind-pass-through: true
client.event-threads: 2
config.client-threads: 2
performance.open-behind-pass-through: true
cluster.locking-scheme: granular
performance.open-behind: off
performance.iot-pass-through: true
config.global-threading: on
config.brick-threads: 64
server.event-threads: 2
cluster.data-self-heal-algorithm: full
performance.io-thread-count: 64
performance.high-prio-threads: 32
performance.normal-prio-threads: 16
performance.low-prio-threads: 8
performance.least-prio-threads: 8
cluster.readdir-optimize: on
cluster.granular-entry-heal: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Mounts happen with:

/usr/sbin/glusterfs --lru-limit=16777216 --inode-table-size=16777216 --invalidate-limit=16 --background-qlen=32 \
    --fuse-mountopts=noatime,nosuid,noexec,nodev --process-name fuse --volfile-server=localhost --volfile-id=mail \
    --fuse-mountopts=noatime,nosuid,noexec,nodev /var/spool/mail

Two nodes mount against localhost, and a handful of others mount against a ucarp address floating between the two servers.

@jkroonza
Copy link
Contributor Author

Ok, so I've managed to ignore this for a while.

iotop today clearly showed approaching of 64 threads blocking on IO.

Upon strace on one of the threads, quite a bit of this was fstatat() and readdir*() related calls.

The load averages on the two hosts that's hosting the bricks are consistently quite high whereas those that's merely running fuse mounts at this stage is relatively quiet.

For this reason again I believe there is work to be done on the brick side. Looking at a state dump there definitely is a table:

[GlusterFS Handshake]
GlusterFS Handshake.program-number=14398633
GlusterFS Handshake.program-version=2
GlusterFS Handshake.latency.SETVOLUME=AVG:13140723.608696 CNT:23 TOTAL:302236643
 MIN:464592 MAX:81209698
conn.0.bound_xl./mnt/gluster/mail.dentry_hashsize=14057
conn.0.bound_xl./mnt/gluster/mail.inode_hashsize=65536
conn.0.bound_xl./mnt/gluster/mail.name=/mnt/gluster/mail/inode
conn.0.bound_xl./mnt/gluster/mail.lru_limit=16384
conn.0.bound_xl./mnt/gluster/mail.active_size=30743
conn.0.bound_xl./mnt/gluster/mail.lru_size=16384
conn.0.bound_xl./mnt/gluster/mail.purge_size=0
conn.0.bound_xl./mnt/gluster/mail.invalidate_size=0

This is also the ONLY table found in the brick process state dump.

I've now proceeded to set network.inode-lru-limit to 1048576 which almost immediately caused lru_size to increase to ~700k entries., and disk IO wait to drop (even through MB/s throughput increased). So that's likely a good thing, but I do think increasing the inode table size too would be a good thing. And like the values for dentry_hashsize too (The fuse mount will also get benefit from this), so I'll try and implement that too as, if I understand correctly, that should improve readdir and lookup performance. I believe the bulk of our workload is readdir and to a degree lookup since these are the functions maildir is heavy on (it uses readdir in order to get a listing of emails in a specific folder). These don't seem quite right, but this is what's reported:

 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 ns       0.00 ns       0.00 ns    10227179077      FORGET
      0.00       0.00 ns       0.00 ns       0.00 ns    11015232186     RELEASE
      0.00       0.00 ns       0.00 ns       0.00 ns      742704822  RELEASEDIR
      0.00   30976.83 ns   22232.00 ns   38055.00 ns              6      STATFS
      0.00   66782.00 ns   40140.00 ns  116525.00 ns             12       FSTAT
      0.00   59375.17 ns   42431.00 ns  130386.00 ns             29    GETXATTR
      0.02 1374560.59 ns  571075.00 ns 2566805.00 ns             17       FSYNC
      0.02   50194.04 ns   34358.00 ns  284471.00 ns            548        STAT
      0.02   18794.05 ns    9713.00 ns  333028.00 ns           1568    FINODELK
      0.04 1446818.43 ns  158354.00 ns 22548617.00 ns             35      RENAME
      0.04   18193.25 ns    9936.00 ns  248276.00 ns           2788     INODELK
      0.04   93263.55 ns   62805.00 ns  204021.00 ns            646     SETATTR
      0.05  103518.74 ns   77591.00 ns  406304.00 ns            748    TRUNCATE
      0.07  127287.13 ns   92311.00 ns 1643713.00 ns            757        LINK
      0.09   13244.59 ns    4720.00 ns  270889.00 ns           9471     ENTRYLK
      0.10   34572.62 ns   23853.00 ns  142482.00 ns           4087     OPENDIR
      0.11  789583.59 ns   24536.00 ns 22792161.00 ns            204        READ
      0.17   84258.60 ns   64184.00 ns  563627.00 ns           2829       WRITE
      0.24  145293.15 ns   98331.00 ns 10638931.00 ns           2317      UNLINK
      0.53  506244.10 ns   73377.00 ns 2156890.00 ns           1496     XATTROP
      0.56  508990.82 ns   55164.00 ns 1637287.00 ns           1568    FXATTROP
      1.10    9374.46 ns    4472.00 ns  175881.00 ns         167341       FLUSH
      1.60 1433413.32 ns  185075.00 ns 52220377.00 ns           1591      CREATE
      5.34   45791.73 ns   31741.00 ns  611212.00 ns         165734        OPEN
     22.00  184206.27 ns   42746.00 ns 1649048844.00 ns         169756      LOOKUP
     67.85 2078239.93 ns   16324.00 ns 319421368.00 ns          46402    READDIRP
 
    Duration: 19930968 seconds
   Data Read: 48786219049248 bytes
Data Written: 6946959191098 bytes

Specifically I don't get why there are relatively few lookup and readdirp calls (counter wrap?), either way, the thing that needs to be targeted here is primarily readdirp performance from the looks of it.

Looking at the code, the way the options are applied on the server side differs quite a bit from the fuse side, so where it's trivial to use options to inode_table_new() to vary the size, in the case of the server things gets a bit more complicated, and either the table needs to be replaced with a new one, or a new array needs to be allocated and under lock all existing inodes listed in this be replaced. Similar will need to happen if modifying the dentry_hashsize on the fly.

This should not be a frequent operation, or possibly one can rely on restarting the bricks to make the table size and dentry size options take effect on the bricks? Either way, which of the three strategies to be followed bugs me a bit currently.

Yes, I'm quite happy to sacrifice RAM for reducing disk IO and reducing contention overall. A table size of 2^10 is still "only" 16MB of RAM, a very small price to pay indeed.

@mohit84
Copy link
Contributor

mohit84 commented Jul 26, 2023

To improve readdirp performance there is one more thing you can try. Instead
of using readdirp we can try readdir+parallel_lookup. Currently you are using
1x2 volume so readdirp would not fetch more than 17 entries per attempt because
fuse readdirp buffer size is 4k and you can't load readdir-ahead to fetch entries in background (because no dht).
As we can see here avg lookup latency is 184us and readdirp is taking 2ms but if we use readdir+parallel_lookup(with 128k buffer) it can fetch more entries and performance can also reduce , the only drawback is it can generate slightly high CPU.
Let me know if you are interested to test it i can share a patch with you. The readdir+parallel_lookup is configurable option , you can use existing readdirp if you don't want to use it.

@jkroonza
Copy link
Contributor Author

from what I've read of parallel it's only really helpful in distribute cases, which is why it's not enabled here.

The 4KB vs 128KB (or even larger say 256KB) sounds potentially much more useful. And yes, at this point in time I'm happy to try more things, so please, fire away with patches - even kernel if that's where the buffer sits?

@mohit84
Copy link
Contributor

mohit84 commented Jul 26, 2023

from what I've read of parallel it's only really helpful in distribute cases, which is why it's not enabled here.

The 4KB vs 128KB (or even larger say 256KB) sounds potentially much more useful. And yes, at this point in time I'm happy to try more things, so please, fire away with patches - even kernel if that's where the buffer sits?

No, the parallel would be helpful for all case. We can wind a lookup call during readdir_cbk at fuse xlator instead of waiting of application/kernel to wind a lookup call. I need some time, before Monday i will share the patch.

@jkroonza
Copy link
Contributor Author

Thanks.

OK. I've now enabled performance.parallel-readdir again. Since I'm assuming this is a client side option I'm guessing I need to remount?

Also just looked at the kernel code (fs/fuse/readdir.c) and it certainly looks like it's allocating a page at a time ... would be tremendously helpful if this too can be increased to read "up to" 128KB at a time at the kernel level. I can certainly look into this too, but it looks like this concept of a page at a time here is somewhat wired into the way things are cached too, so a little worried that this may be somewhat difficult. Assuming that it's a simple change in fuse_readdir_uncached() to modify the alloc_page() call to alloc_pages() with an order value of 5 then (4KiB pages x (2^5==32) pages == 128KiB), would current glusterfs code handle that? (This I can test without affecting production quite trivially).

Even if you wind an additional readdir() call (which would basically result in a read-ahead for readdir if I understand your strategy correctly? How does this compare to performance.readdir-ahead?) that would still benefit the overall performance, even if reading then happens in 32 pages at a time.

I'll work on a patch for increasing the dentry_hashsize on fuse client in the meantime so long as well (I'm not sure how/where this would help to be perfectly honest, thinking through this again it's possibly just a name => inode cache, so primarily for lookup performance, which whilst problematic isn't the most problematic issue).

Guidance on how to implement inode table size and dentry table sizes for the server/brick side would be appreciated.

Please help my understanding if I go off the rail, I'm not intimately familiar with the glusterfs internals, but have managed to learn a fair bit over the years, but every time it feels like a steep curve all over again.

@jkroonza
Copy link
Contributor Author

Either parallel or readdir-ahead is causing major problems of the stuff is blocking badly on fuse kind of scenario. I've backed out performance.readdir-ahead for the time being to see if this resolves it.

@mohit84
Copy link
Contributor

mohit84 commented Jul 26, 2023

Thanks.

OK. I've now enabled performance.parallel-readdir again. Since I'm assuming this is a client side option I'm guessing I need to remount?

Also just looked at the kernel code (fs/fuse/readdir.c) and it certainly looks like it's allocating a page at a time ... would be tremendously helpful if this too can be increased to read "up to" 128KB at a time at the kernel level. I can certainly look into this too, but it looks like this concept of a page at a time here is somewhat wired into the way things are cached too, so a little worried that this may be somewhat difficult. Assuming that it's a simple change in fuse_readdir_uncached() to modify the alloc_page() call to alloc_pages() with an order value of 5 then (4KiB pages x (2^5==32) pages == 128KiB), would current glusterfs code handle that? (This I can test without affecting production quite trivially).

Even if you wind an additional readdir() call (which would basically result in a read-ahead for readdir if I understand your strategy correctly? How does this compare to performance.readdir-ahead?) that would still benefit the overall performance, even if reading then happens in 32 pages at a time.

I'll work on a patch for increasing the dentry_hashsize on fuse client in the meantime so long as well (I'm not sure how/where this would help to be perfectly honest, thinking through this again it's possibly just a name => inode cache, so primarily for lookup performance, which whilst problematic isn't the most problematic issue).

Guidance on how to implement inode table size and dentry table sizes for the server/brick side would be appreciated.

Please help my understanding if I go off the rail, I'm not intimately familiar with the glusterfs internals, but have managed to learn a fair bit over the years, but every time it feels like a steep curve all over again.

Here I did not mean to use paralel-readdir, parallel-readdir we can use only in case of dht along with readdir-ahead. I mean
to say implement a new logic at fuse layer with my patch. We don;t need to change at kernel level, we can change buffer size
at fuse xlator level while wind a readdir call and in readdir_cbk we can do lookup for all the entries before transfer the
data to the kernel.The fuse will transfer the data(readdir) to kernel only after receive all lookup responses so in short all
lookup will be parallel. At the time of receive a lookup response we can take extra reference on inode and so that inode will
be keep always in active list. We can keep inode active for 10m after 10m the inode will be unref automatically so if application
has requested metadata it will serve from the md-cache and it will be fast otherwise the md-cache wind a lookup call to backend.I will work on the same and share some test kind of patch. You can test in your environment.

@jkroonza
Copy link
Contributor Author

The kernel patch that I submitted to increase the kernel read via fuse from 4KB to 128KB already showing promise. Instead of application readdir() getting 14 entries on average it now gets just under 500 on every system call, which is a HUGE improvement already, but the overall system call time is definitely longer, so your patch will still be required.

@mohit84
Copy link
Contributor

mohit84 commented Jul 26, 2023

Can you please share the patch link just for reference.

@jkroonza
Copy link
Contributor Author

@mykaul
Copy link
Contributor

mykaul commented Jul 27, 2023

Release 11 has many (small) improvements around inode.c, readdir(p) code and memory allocations (mainly locality changes and less calls to malloc/free). I wonder if for your use case it'll make a difference. Of course, the above change looks much more critical.

@jkroonza
Copy link
Contributor Author

At this stage I'm inclined to think it can't get worse. If I upgrade opversion, even if it implies downtime, it's possible to downgrade again?

mohit84 added a commit to mohit84/glusterfs that referenced this issue Jul 28, 2023
There are multiple readdir improvement with this patch
for fuse.
1) Set the buffer size 128KB during wind a readdir call by fuse
2) Wind a lookup call for all entries parallel by fuse_readdir_cbk
3) Keep the inode in active list after taking extra reference for
   10m(600s) sothat in next lookup is served by md-cache
4) Take the reference only while table->active size is less than 1M

Fixes: gluster#4176
Change-Id: Ic9c75799883d94ca473c230e6213cbf00b383684
Signed-off-by: Mohit Agrawal <[email protected]>
@mohit84
Copy link
Contributor

mohit84 commented Jul 28, 2023

@jkroonza Please test the patch in your test environment first, then you can apply the patch in your production environment.

To enable the readdir first you have to disable readdirp, to disable readdirp you can apply below settings on the volume

gluster v set test dht.force-readdirp off
gluster v set test performance.force-readdirp off
Need to disable quick-read also otherwise during lookup it will try to read content of the file and hurt the lookup performance.
gluster v set test performance.quick-read off

The volume needs to mount like this, need to pass two arguments (use-readdirp=no to disable readdirp and readdir-optimize=yes to enable readdir optimize)

mount -t glusterfs <ip_address>:/<vol_name> <mount_point> -ouse-readdirp=no -o readdir-optimize=yes

@jkroonza
Copy link
Contributor Author

@mohit84 why should readdirp be off? Surely the whole point of readdirp is to improve performance? Btw, with the kernel patch we're now getting io queue lengths of >40 as measured with iostat -dmx, so it looks like just making that IO buffer on fuse from kernel side bigger has made a huge difference already. But the system call response times are still highly variable.

Any changes required server/brick side?

Current settings for those mentioned:

dht.force-readdirp                       on (DEFAULT)                           
performance.force-readdirp               true (DEFAULT)                         
performance.quick-read                   on                                     

So they all need to change. Will quick-read make a difference in general anyway? What is the effect of quick-read (Can't recall from memory now why we set it).

@mohit84
Copy link
Contributor

mohit84 commented Jul 28, 2023

@mohit84 why should readdirp be off? Surely the whole point of readdirp is to improve performance? Btw, with the kernel patch we're now getting io queue lengths of >40 as measured with iostat -dmx, so it looks like just making that IO buffer on fuse from kernel side bigger has made a huge difference already. But the system call response times are still highly variable.

Any changes required server/brick side?

Current settings for those mentioned:

dht.force-readdirp                       on (DEFAULT)                           
performance.force-readdirp               true (DEFAULT)                         
performance.quick-read                   on                                     

So they all need to change. Will quick-read make a difference in general anyway? What is the effect of quick-read (Can't recall from memory now why we set it).

We can use anyone to listdir either readdir or readdirp not both. It depends on application usage(ls vs ll). If application does lookup after readdir operation then readdirp or patch(readdir+parallel_lookup) is useful but if application does
only readdir no inode metadata access then only plain readdir is ok. By default readdirp is enabled.
The current readdir does not perform well because it winds lookup operation sequentially that;s why default readdirp is enabled.
After this patch it winds a lookup during readdir parallel instead of waiting for kernel/application so when application do lookup the lookup fop serves from the md-cache.

There are certain cases when readdir+parallel_lookup outperform readdirp when application
wind listdir(ll) again and again because the new patch caches entries in md-cache for 10m.
There is no specific change require on server side only small change need to do upcall xlator
that is applicable for cache-invalidation. You can avoid it if you don;t want to apply a patch on server.

The quick-read improves application performance if it(app) reads frequently, it fetches first 64k during lookup
and save in cache but at the same time it hurts lookup performance. If application is not read intensive then you
can disable it.

@jkroonza
Copy link
Contributor Author

Thanks. I've disabled the quick-read so long, as I think for the majority of our cases that's irrelevant. More and more I think that readdir is for our use case the single most important operation.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 3, 2023

We've added 1TB of NVMe dm-cache (smq, writethrough) in front of the raid controller over the weekend, specifically on the gluster LV which is 8TB in size. Cache lines are 1MB in size. Cache has populated over the weekend and we're seeing >90% cache hit ratio on reads. I mention this because this change by itself has made a massive impact. The kernel patch I cooked last week got us to the point where the RAID controller became (from what we could see) the single largest bottleneck, which is what prompted the hardware change. And this only comes now since it took me MUCH longer than expected to get the below figures (had some issues upgrading the node in the DC I could use for testing without affecting the rest of the cluster).

Performance stats:

Directly on the brick:

Filesystem                 Inodes    IUsed     IFree IUse% Mounted on
/dev/mapper/lvm-gluster 402653184 12408250 390244934    4% /mnt/gluster

# time find /mt/gluster/mail/* >/dev/null
real    1m1.302s
user    0m9.710s
sys     0m10.671s

So that's theoretically the fastest possible we can go, and is much, much faster with the NVMe in place than without. Rest of the tests were done against glusterfs 10.4 glusterd and bricks (up to this point).

First tests with:

performance.force-readdirp               true (DEFAULT)                         
dht.force-readdirp                       on (DEFAULT)                           

Mounts for this is done with (except where stated otherwise) fuse module patched to issue 128KB reads for readdir() via fuse rather than 4KB:

/usr/sbin/glusterfs --lru-limit=1048576 --inode-table-size=65536 --invalidate-limit=16 --background-qlen=32 --fuse-mountopts=noatime,nodev,nosuid,noexec --process-name fuse --volfile-server=mail.iewc.co.za --volfile-id=mail --fuse-mountopts=noatime,nodev,nosuid,noexec /mnt/t

Remote site (~7ms round-trip-time, up to 150Mbps, but we see nowhere near that), with fuse 10.4

real    165m8.606s
user    0m16.258s
sys     0m53.531s

Remote site, with fuse 11.0 (unpatched).

real    143m54.534s
user    0m16.642s
sys 0m47.618s

It must be noted a previous run of the above took 238 minutes - far worse than 10.4 - the re-run was also made overnight though so this is definitely a more fair comparison.

Remote site, with fuse 11.0, patched without -o use-readdirp=no -o readdir-optimize=yes

real	254m44.046s
user	0m8.115s
sys	0m20.620s

This was NOT expected. So the patch definitely worsens things.

Remote site, with fuse 11.0, patched with -o use-readdirp=no -o readdir-optimize=yes (but volume settings unchanged).

real	245m15.254s
user	0m8.498s
sys	0m10.766s

real	226m18.885s
user	0m18.566s
sys	0m25.220s

Local (<1ms rtt), fuse 10.4:

real    78m12.205s
user    0m6.047s
sys     0m22.225s

Local, fuse 11.0 (unpatched):

real    63m44.914s
user    0m7.585s
sys     0m32.102s

This already looks very promising, but could be simple variance based on overall load on the rest of the cluster. But since the remote test showed similar improvement I'd say it's likely legit, and percentage wise we'll take a nearly 20% improvement.

Local, fuse 11.0 (patched, with -o use-readdirp=no -o readdir-optimize=yes, volume settings unchanged).

real    359m45.421s
user    0m8.685s
sys     0m8.027s

real    217m2.902s
user    0m6.843s
sys     0m5.482s

Really not expected. The remote test ran concurrently to this and finished in a shorter timeframe on the first run? That makes little to no sense. On the second run (different time of day) it performed worse again than the initial run but more in line with the remote run. Which indicates that the latency factor has little to do with the worse performance.

After changing the volume options, and restarting the bricks too (just to be on the safe side) and remounting the two test hosts we re-ran:

We didn't let this complete as the overall throughput on the cluster ground to a halt. strace -T -p $(pidof find) showed getdents64() normally taking around 100ms/call, but upwards of 3.5s and even close on 4.5s. Even simple touch statements of individual files clocked in at 5 to 10s.

Conclusion:

We're sticking with force-readdirp but without quick-read, using unpatched 11.0. The patch seems to have unintended side effects and I can't vouch for that, some getdents() calls from userspace with find takes upwards of 3.5s (measured on the "local" node) in this configuration where otherwise we'd only see up to ~1.5s. This is especially visible on larger folders (meaning many files in the listing).

We're willing to re-test once all nodes have been upgraded to 11.0 we'll re-patch one of the mount-only nodes, unless there are code changes on the brick side I missed, in which case we'll need to patch both bricks and perform tests on those nodes.

@mohit84
Copy link
Contributor

mohit84 commented Aug 3, 2023

@jkroonza Do you use slack? Please let me know if you have we can discuss on slack.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 3, 2023

We've added 1TB of NVMe dm-cache (smq, writethrough) in front of the raid controller over the weekend, specifically on the gluster LV which is 8TB in size. Cache lines are 1MB in size. Cache has populated over the weekend and we're seeing >90% cache hit ratio on reads. I mention this because this change by itself has made a massive impact. The kernel patch I cooked last week got us to the point where the RAID controller became (from what we could see) the single largest bottleneck, which is what prompted the hardware change. And this only comes now since it took me MUCH longer than expected to get the below figures (had some issues upgrading the node in the DC I could use for testing without affecting the rest of the cluster).

Performance stats:

Directly on the brick:

Filesystem                 Inodes    IUsed     IFree IUse% Mounted on
/dev/mapper/lvm-gluster 402653184 12408250 390244934    4% /mnt/gluster

# time find /mt/gluster/mail/* >/dev/null
real    1m1.302s
user    0m9.710s
sys     0m10.671s

So that's theoretically the fastest possible we can go, and is much, much faster with the NVMe in place than without. Rest of the tests were done against glusterfs 10.4 glusterd and bricks (up to this point).

First tests with:

performance.force-readdirp               true (DEFAULT)                         
dht.force-readdirp                       on (DEFAULT)                           

Mounts for this is done with (except where stated otherwise) fuse module patched to issue 128KB reads for readdir() via fuse rather than 4KB:

/usr/sbin/glusterfs --lru-limit=1048576 --inode-table-size=65536 --invalidate-limit=16 --background-qlen=32 --fuse-mountopts=noatime,nodev,nosuid,noexec --process-name fuse --volfile-server=mail.iewc.co.za --volfile-id=mail --fuse-mountopts=noatime,nodev,nosuid,noexec /mnt/t

Remote site (~7ms round-trip-time, up to 150Mbps, but we see nowhere near that), with fuse 10.4

real    165m8.606s
user    0m16.258s
sys     0m53.531s

Remote site, with fuse 11.0 (unpatched).

real    143m54.534s
user    0m16.642s
sys 0m47.618s

It must be noted a previous run of the above took 238 minutes - far worse than 10.4 - the re-run was also made overnight though so this is definitely a more fair comparison.

Remote site, with fuse 11.0, patched without -o use-readdirp=no -o readdir-optimize=yes

real	254m44.046s
user	0m8.115s
sys	0m20.620s

This was NOT expected. So the patch definitely worsens things.

Remote site, with fuse 11.0, patched with -o use-readdirp=no -o readdir-optimize=yes (but volume settings unchanged).

real	245m15.254s
user	0m8.498s
sys	0m10.766s

real	226m18.885s
user	0m18.566s
sys	0m25.220s

Local (<1ms rtt), fuse 10.4:

real    78m12.205s
user    0m6.047s
sys     0m22.225s

Local, fuse 11.0 (unpatched):

real    63m44.914s
user    0m7.585s
sys     0m32.102s

This already looks very promising, but could be simple variance based on overall load on the rest of the cluster. But since the remote test showed similar improvement I'd say it's likely legit, and percentage wise we'll take a nearly 20% improvement.

Local, fuse 11.0 (patched, with -o use-readdirp=no -o readdir-optimize=yes, volume settings unchanged).

real    359m45.421s
user    0m8.685s
sys     0m8.027s

real    217m2.902s
user    0m6.843s
sys     0m5.482s

Really not expected. The remote test ran concurrently to this and finished in a shorter timeframe on the first run? That makes little to no sense. On the second run (different time of day) it performed worse again than the initial run but more in line with the remote run. Which indicates that the latency factor has little to do with the worse performance.

After changing the volume options, and restarting the bricks too (just to be on the safe side) and remounting the two test hosts we re-ran:

We didn't let this complete as the overall throughput on the cluster ground to a halt. strace -T -p $(pidof find) showed getdents64() normally taking around 100ms/call, but upwards of 3.5s and even close on 4.5s. Even simple touch statements of individual files clocked in at 5 to 10s.

Conclusion:

We're sticking with force-readdirp but without quick-read, using unpatched 11.0. The patch seems to have unintended side effects and I can't vouch for that, some getdents() calls from userspace with find takes upwards of 3.5s (measured on the "local" node) in this configuration where otherwise we'd only see up to ~1.5s. This is especially visible on larger folders (meaning many files in the listing).

We're willing to re-test once all nodes have been upgraded to 11.0 we'll re-patch one of the mount-only nodes, unless there are code changes on the brick side I missed, in which case we'll need to patch both bricks and perform tests on those nodes.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 3, 2023

@jkroonza Do you use slack? Please let me know if you have we can discuss on slack.

No I do not, but you're welcome to find jkroon on Libera.chat.

Previous post got double-posted somehow. But yea, for the moment I'm backing out the patch, and leaving one client node on glusterfs 11.0. quick-read disabled.

@mohit84
Copy link
Contributor

mohit84 commented Aug 3, 2023

@jkroonza Do you use slack? Please let me know if you have we can discuss on slack.

No I do not, but you're welcome to find jkroon on Libera.chat.

Previous post got double-posted somehow. But yea, for the moment I'm backing out the patch, and leaving one client node on glusterfs 11.0. quick-read disabled.

I will try to catch you on tomorrow, please generate statedump also next time when you will run a test case.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 3, 2023

Sure, I can do that tonight, where would you like state dumps? The bricks presumably, and on the test node? I'll be able to do the remote node test for you tonight with state dumps, but not the local node (where the patch has been reverted to assist with client load, and we're busy comparing CPU usage between 10.4 and 11.0 for the FUSE process at similar loads - ie, 25% of clients towards each of four nodes).

@mohit84
Copy link
Contributor

mohit84 commented Aug 3, 2023

Sure, I can do that tonight, where would you like state dumps? The bricks presumably, and on the test node? I'll be able to do the remote node test for you tonight with state dumps, but not the local node (where the patch has been reverted to assist with client load, and we're busy comparing CPU usage between 10.4 and 11.0 for the FUSE process at similar loads - ie, 25% of clients towards each of four nodes).

It's ok if you can take statedump of the client from the node where you applied the patch.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 3, 2023

/usr/sbin/glusterfs --use-readdirp=no --readdir-optimize=yes --lru-limit=16777216 --inode-table-size=16777216 --invalidate-limit=16 --background-qlen=32 --fuse-mountopts=noatime,nodev,nosuid,noexec --process-name fuse --volfile-server=mail.iewc.co.za --volfile-id=mail --fuse-mountopts=noatime,nodev,nosuid,noexec /mnt/t

https://downloads.uls.co.za/glusterdump.10287.dump.1691089919.bz2 (8MB).

@mohit84
Copy link
Contributor

mohit84 commented Aug 4, 2023

Just FYI you are not able to use the patch completely because you have not disabled readdirp.
As we can see readdir is enabled only for fuse/md-cache xlator after that it is considering readdirp, the difference between readdir and readdirp is readdirp fetches imetadata along with entry names in case if readdir it only fetches
list of dir entries not metadata. It is not recommended to use both because there is no benefit, here in this case fuse is trying to wind parallel LOOKUP call during readdir even other xlators are winding readdirp.

If you want to test readdir + parallel_ lookup completely you have to disable readddirp completely as i mentioned earlier (gluster v set test dht.force-readdirp off, gluster v set test performance.force-readdirp off). If you want to disable only for the specific node you can change client volfile and disable readdirp for dht and md-cache xlator and pass the new client volifile during mount a volume.

fuse.latency.READDIR=AVG:35430425.370853 CNT:52293 TOTAL:1852763233918 MIN:5418066 MAX:19377263456
mail-md-cache.latency.READDIR=AVG:35422274.403553 CNT:52293 TOTAL:1852336995385 MIN:5411776 MAX:19377254619
mail-replicate-0.latency.READDIRP=AVG:35376671.393016 CNT:52293 TOTAL:1849952277155 MIN:5397890 MAX:19376833365
mail-client-1.latency.READDIRP=AVG:34672443.026482 CNT:26433 TOTAL:916496686519 MIN:5387469 MAX:9580046160
mail-client-0.latency.READDIRP=AVG:36058550.595244 CNT:25860 TOTAL:932474118393 MIN:5442254 MAX:19376795028

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 4, 2023

We did disable that on the previous run.

Bad performance doesn't begin to describe it.

@mohit84
Copy link
Contributor

mohit84 commented Aug 5, 2023

We did disable that on the previous run.

Bad performance doesn't begin to describe it.

You need to enable "performance.md-cache-timeout 60" also to get improvement after applying the patch. By default
md-cache-timeout is 1s that;s why metadata data is not served by md-cache even inode is keep active by the fuse xlator
after applying a patch.In default case the metadata will be serve from the cache only if application access the metadata within
1s after configure 60 you can get more improvement because it will be serve from the cache. If you want to increase more than
60 you have to enable cache-invalidation feature option then you can increase the value futher.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 7, 2023

md here I assume means metadata?

I'm not even phased by any metadata or calls to stat(). Just consecutive getdents64 calls are worse post this patch compared to prior.

@mohit84
Copy link
Contributor

mohit84 commented Aug 9, 2023

Yes md means metadata xlator.
Are you interested only for getdents64 calls not for lookup?

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 9, 2023

getdents64, as far as I can determine, is the one causing problems. If that uses LOOKUP behind the scenes to fill the linux_dirent structures then yes I care about it, otherwise no I don't think so.

From the strace's I've looked at in general the system calls are fine, but the moment the courier-imapd process starts issuing a sequence of getdents64() system calls (as a result of glibc readdir() calls - where glibc internally uses getdents64 to the kernel), then my observation is that the initial few calls is reasonably fast, thereafter it slows down.

I've patched the kernel now to issue 128KB reads towards glusterfs via FUSE instead of 4KB, and that made a much, much larger difference than the patch here.

So combined with disabling quick-read we seem to be (for the time being at least) at an acceptable level of performance.

I think what needs to be kept in mind is that DJB has taken measures to avoid the necessity of things like stat() on maildir by encoding important information (like delivery timestamp and file size) straight into the filename. So as long as the maildir isn't corrupt (something for which I've built my own check tools) a stat() call is very seldomly required, even though courier-imapd do perform them from time to time, their performance on average seems reasonable, even though I suspect we could win a small margin overall by caching the metadata here, I don't (currently) think it's particularly relevant, at least, not in comparison with the getdents64() calls.

@mohit84
Copy link
Contributor

mohit84 commented Aug 9, 2023

getdents64, as far as I can determine, is the one causing problems. If that uses LOOKUP behind the scenes to fill the linux_dirent structures then yes I care about it, otherwise no I don't think so.

From the strace's I've looked at in general the system calls are fine, but the moment the courier-imapd process starts issuing a sequence of getdents64() system calls (as a result of glibc readdir() calls - where glibc internally uses getdents64 to the kernel), then my observation is that the initial few calls is reasonably fast, thereafter it slows down.

I've patched the kernel now to issue 128KB reads towards glusterfs via FUSE instead of 4KB, and that made a much, much larger difference than the patch here.

So combined with disabling quick-read we seem to be (for the time being at least) at an acceptable level of performance.

I think what needs to be kept in mind is that DJB has taken measures to avoid the necessity of things like stat() on maildir by encoding important information (like delivery timestamp and file size) straight into the filename. So as long as the maildir isn't corrupt (something for which I've built my own check tools) a stat() call is very seldomly required, even though courier-imapd do perform them from time to time, their performance on average seems reasonable, even though I suspect we could win a small margin overall by caching the metadata here, I don't (currently) think it's particularly relevant, at least, not in comparison with the getdents64() calls.

Thanks for describe your case in details i don't think glibc trigger a lookup operation during getdents. IIUC in best case maildir does not trigger a lookup operation so we don;t need to wind a lookup call during readdir and you will get more improvement as compare to current readdirp because there is no problem with readdir at brick level.The readdir is fast the performance is slow while it does lookup along with readdirp or separate lookup. Let me know if you want a new patch, i can make it configurable the fuse will trigger a lookup operation during readdir only while user has enabled it.

@mohit84
Copy link
Contributor

mohit84 commented Aug 9, 2023

The other way is you can apply your kernel_patch(128K buffer_size) and disable readdirp instead of applying any gluster patch to get the best performance as per your current need.

@jkroonza
Copy link
Contributor Author

jkroonza commented Aug 9, 2023

What are the resultant difference between readdir and readdirp (ignoring the pre-emptive lookup call)?

The only thing I can imagine is filling the additional details vs not (filling of d_type), which on every filesystem I've worked with recently is provided by pretty much every underlying filesystem we will bump into?

Also, given that I can guarantee the pre-conditions of the checks done by courier-imapd by using lstat (courier-imapd basically uses that to verify that no-one put "strategic" symlinks into place, and since our hosts would need to be severely compromised before that can happen it's pretty much a non-concern) I'm fairly certain I can hack out those stat calls in any case, which means that beyond readdir() - which translates into getdents64() to the kernel - the bulk of other calls should be open().

Do you think that the resulting slow getdents64() comes from one of the following:

  1. additionally wound LOOKUP calls?
  2. the brick purging the opened inode from the inode table and thus having to re-open and restarting the readdir() from the beginning?
  3. combination of both?

How can this be confirmed?

DISCLAIMER: I'm not intimately familiar with all the concepts under discussion, so to a large degree I'm subject to your guidance here.

@mohit84
Copy link
Contributor

mohit84 commented Aug 13, 2023

What are the resultant difference between readdir and readdirp (ignoring the pre-emptive lookup call)?

The only thing I can imagine is filling the additional details vs not (filling of d_type), which on every filesystem I've worked with recently is provided by pretty much every underlying filesystem we will bump into?

Also, given that I can guarantee the pre-conditions of the checks done by courier-imapd by using lstat (courier-imapd basically uses that to verify that no-one put "strategic" symlinks into place, and since our hosts would need to be severely compromised before that can happen it's pretty much a non-concern) I'm fairly certain I can hack out those stat calls in any case, which means that beyond readdir() - which translates into getdents64() to the kernel - the bulk of other calls should be open().

Do you think that the resulting slow getdents64() comes from one of the following:

  1. additionally wound LOOKUP calls?
  2. the brick purging the opened inode from the inode table and thus having to re-open and restarting the readdir() from the beginning?
  3. combination of both?

How can this be confirmed?

DISCLAIMER: I'm not intimately familiar with all the concepts under discussion, so to a large degree I'm subject to your guidance here.

AFAIK the linux kernel does not provide any systemcall specific to readdirp, here in this case(glusterfs) it is supported by fuse. At gluster level we use below dentry structure to populate dentry list

struct _gf_dirent {
union {
struct list_head list;
struct {
struct _gf_dirent *next;
struct _gf_dirent *prev;
};
};
uint64_t d_ino;
uint64_t d_off;
uint32_t d_len;
uint32_t d_type;
struct iatt d_stat;
dict_t *dict;
inode_t *inode;
char d_name[];
};

As we know here d_ino/d_off/d_len/d_type/d_name all are return by getdents call, for other like d_stat it is not return by getdents call so if user has enabled readdirp(default option) in that case the brick process first prepare a list of dentries in the function (posix_fill_readdir) and after that it call the function posix_readdirp_fill to fetch the metadata(to fill d_stat)
and xattrs from the backend. In case of readdir it(the brick process) fetches only dentries, it does not call posix_readdirp_fill.

I believe The getdents call does not take much time , the slowness is at the function level posix_readdirp_fill(due to disk lookup) or posix_lookup(in case of plain readdir) so if application does not need metadata in that case either plain readdir with your kernel patch or need to increase buffer size at gluster level for plain readdir you can get good improvement as compare to readdirp.

To confirm the performance difference you can do one thing
Apply your kernel patch + trigger both operation readdir/readdirp. In case of readdir the brick process call only getdents , in
case of readdirp it call getdents + lookup(sequentially) to fetch the metadata.You would see the differences.
The readdirp disable is slightly tricky in gluster you have to disable it at dht/md-cache/fuse level otherwise you would not
be able to disable it completely.

Additional lookup call can be trigger either by application or by kernel, you have to debug it.

@jkroonza
Copy link
Contributor Author

Hi,

Sorry, I think the fact that English is not first language for either of us is making this harder. Let me digest on that a bit, but out of hand it looks like plain readdir really should be more than adequate for our use-case.

In some cases however (I don't know which) courier-imapd does perform a sequence of lstat() calls (resluting in fstatat(..., O_NOFOLLOW) to the kernel), so I'm guessing this may be why readdirplus overall still performs better for us compared to readdir since the stat data will already (to some degree) be cached in-kernel. As far as I could determine those lstats() are purely for the purpose of checking that nobody dropped symbolic links into the maildir folders, so I could remove those (probably by way of additional code to expose this check as a configuration option), this should result in a rather major performance boost, but I would want to check my logic with upstream project first prior to doing that.

For the time being quick-read seems to improve things, as does the larger kernel to usespace bugger, and most certainly the NVMe dm-cache also made a HUGE difference.

@mohit84
Copy link
Contributor

mohit84 commented Aug 14, 2023

Hi,

Sorry, I think the fact that English is not first language for either of us is making this harder. Let me digest on that a bit, but out of hand it looks like plain readdir really should be more than adequate for our use-case.

In some cases however (I don't know which) courier-imapd does perform a sequence of lstat() calls (resluting in fstatat(..., O_NOFOLLOW) to the kernel), so I'm guessing this may be why readdirplus overall still performs better for us compared to readdir since the stat data will already (to some degree) be cached in-kernel. As far as I could determine those lstats() are purely for the purpose of checking that nobody dropped symbolic links into the maildir folders, so I could remove those (probably by way of additional code to expose this check as a configuration option), this should result in a rather major performance boost, but I would want to check my logic with upstream project first prior to doing that.

For the time being quick-read seems to improve things, as does the larger kernel to usespace bugger, and most certainly the NVMe dm-cache also made a HUGE difference.

Ack, you can check first upstream project about to avoid lookup call then you can use plain readdir call otherwise readdirp is ok.

@jkroonza
Copy link
Contributor Author

courier-imap PR for same has been merged, so it won't eliminate stat() calls, but it will reduce them (potentially significantly):

svarshavchik/courier-libs#33

Explanations as to WHY it might be happening is also there.

@jkroonza
Copy link
Contributor Author

jkroonza commented Sep 6, 2023

I'm going to close this, seems the larger readdir buffer on the kernel side did the trick for us for the time being, and the patches here definitely degraded overall performance.

@jkroonza jkroonza closed this as completed Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants