-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content-sqlite: preallocate database space #6169
Comments
i was thinking about this last night and my initial intuition was that kind of trick wouldn't work. I thought that in most modern databases they constantly append new data and the only way to reclaim disk space is through "compacting"/"defragmenting" it. In sqlite that appears to be via "vacuum" (https://www.sqlite.org/lang_vacuum.html). And even if we did this, we might be racing against the thingie filling up disk (e.g. errant logging). But it's better than nothing. The hack above seems to work around this fact by pre-allocating an entire database table then dropping it. So perhaps this will work around it, but it seems to suggest we need to pre-allocate the entire database table before creating it the first time? Perhaps this idea might work if we pre-allocate critical "checkpoint" and similar things to a separate database table than the core KVS? (I guess this would be via some other backing store ... Or maybe a way to load backing store module twice?) Nevertheless, it's more than worthwhile to experiment with this idea and see. |
We should do an experiment, but based on those stack overflow comments, it seems like sqlite must reuse space within the db file, and vacuum just shrinks the file to not contain any free space.
I was thinking this would be done up front when flux starts. The sys admins would configure Same basic idea as a partition but apparently a partition is more of a pain for the sys admins. |
So the first test didn't seem to work. Created a 5M tmpfs.
so we got about a 20K database file. Lets do the trick listed in the stackoverflow.
ok good, now the database is listed as about 1 meg. re-start the instance w/ this statedir, and fill up the statedir with some kruft via script I wrote.
So if sqlite is re-using the deadspace, I should be able to run a job. But it didn't work
My suspicion is that this issue is b/c we use the write-ahead-log. There may technically be space in the actual database that can be re-used, but there is no space for the write ahead log. Further playing around / investigation to be done. As an aside, I tried using VACUUM
so that did do the trick to free up some space. |
Ouch, I forgot about the separate $ mkdir mnt
$ dd if=/dev/zero of=loopbackfile.img bs=100M count=10
$ mkfs.ext4 loopbackfile.img
$ unshare -rm bash
# fuse2fs loopbackfile.img mnt The trick would be we'd have to both do the mount and launch the broker under the |
Confirmed that turning journaling off ( |
@trws I assume the goal of using unshare is to hide the mount from other users. Is it strictly necessary? Especially on rank 0, (i.e. mgmt node that normal users aren't on)? |
Let's not do a loopback mount. The fact that our admins are inexplicably resistant to cordoning off space on elcapi doesn't seem like justification for adding this to Flux, and it would create complications for access from other than the broker, e.g. during recovery or backup. Plus it feels like we're taking a "no" from the admins and turning it into a "yes" with jedi mind tricks. They control the flux config and the fstab. If I were an admin, I would summarily dismiss this option and either create the fstab entry or not. |
That's a side-effect, but no that's not the reason. It's there to create a mount namespace so that we can mount the fuse filesystem without admin access. It makes it completely doable as just the flux user, in the systemd unit or a script or similar. To your point @garlick it's clearly not at all ideal, I do get that. If we could grab a library and use it to just open a regular file as a backing store to use to let us do this, that would be a whole lot more satisfying, just trying to throw out whatever I can think of in hopes maybe we'll come up with something better. 😕 There are such things, but I don't know if there's a good way to hook them into sqlite. Maybe that's something to look at, alternate storage hooks backend for sqlite that would let just point it at a pre-allocated file? Will do some digging. |
Gotcha. We might try again asking for a proper file system and avoid expending too much effort here. |
A random idea that crossed my mind, so I'm just going to throw it out there so i don't forget (super related to issue related to #6154). The pre-allocate trick might be something we can use for "desperate" and "important situations", e.g. we're tearing down and trying to checkpoint final rootrefs. In those desperate situations, could just turn off write-ahead-log, etc. We could perhaps just allocate a few megs in the sqlite database for these desperate situations and delete it if needed. |
Oh that's a good idea! Couldn't we just pre-allocate whatever large amount of space we want, then if we ever get ENOSPC, turn off the WAL and try again? |
Oh that's a fair point, I didn't even think of that. If we hit ENOSPC, we could just turn off WAL in general. I guess in my mind the general risks associated with not having WAL wasn't worth it in the general sense. We only want to use it in "desperate times". |
The first hit that came up was this. Pluggable sqlite backend store is...a KVS! Yow. This is what it feels like when you realize you might have one two many abstraction layers. |
Hmmmm. If we pre-allocate, is it safe if we turn off WAL and turn synchronous to "FULL"?
from https://www.sqlite.org/pragma.html I'm thinking it is .... I can't convince myself otherwise right now. Edit: in otherwords, when we reach this ENOSPC bad state ... making sqlite slower is an acceptable trade off to ensure we continue operations. Edit 2: Hmmmm, maybe not as safe as we'd like.
So journal_mode = OFF is more risky. Setting journaling to PERSIST may be possible?
I'm not sure if we could "pre-allocate" a sqlite journal flie so that it is always there. Or there could be a "chain" of riskiness as we get more and more desperate |
Yeah, that's kind of what I meant. I have no idea if the transition is safe. Might have to close the database (which I assume would fully process the WAL) and reopen it again with different pragmas? |
Yeah, I think we have to close & re-open with new pragmas. Which I don't think is a big deal, since sqlite is effectively disabled by the time we would try this anyways. i.e. if we can't re-open with new pragmas, we're not really any worse off than we were before. |
Well if we have space reserved, we should be OK I would think! We should try it. |
Agreed, some experimental work with these ideas is worthwhile. Gonna try some stuff out. |
To your point above @garlick, the sqlite4 default pluggable backend is now a log-structured merge tree (kvs basically) that actually uses some of the WAL methods even outside of WAL mode. Might be another thing to consider, especially if its side files are stable size. |
As an initial experiment, did the follow
So I think in principle, this can work. Lots of work to try and get this to work in the actual content-sqlite, but will start with an experimental build. As an aside, using PERSIST journal mode didn't work. So it's going to have to be "OFF" + "FULL" |
I got a experimental "preallocate" working w/ content-sqlite. First things first, wanted to see if this impacts performance in a meaningful way. Initially it appeared there was a small performance dip. In my docker container I would:
w/o pre-allocation my results are
an average of 101.7 jobs/s when I pre-allocate about ~1G of space
an average of 95.7 jobs/s. So around a 5.8% dip. Granted this is not the best test, running in a docker container on my laptop. Then I tried 32K jobs instead of 16K to try and get a better averaging over time.
The pre-allocate runs a bit faster (66.3 vs 64.1). So I think the minor variance appears to be in the acceptable error range of what we're expecting and shouldn't scare us into continuing on w/ the pre-allocate work. (side note, I tried to 65536 jobs too, but it was so slow I gave up. I have to pre-allocate 2G of space instead of 1G, and As an aside, just for experimental purposes, I tried to run this test w/ journal_mode "OFF" + synchronous "FULL" just to see how much performance impact it has (note, was not preallocating)
that's a nice chunk below other results (~18%) and outside of our expected "error range". Edit: hmmmmm at 64K jobs, pre-allocate was a lot slower. perhaps need to test on some LC machines that are beefier.
Edit2: at 32K jobs, performance identical on corona
|
just posted a WIP in #6217, but wanted to bring up for discussion the following important thing. Pre-allocate only will work if journaling is not used. Code in #6217 will disable journaling if it is currently enabled and ENOSPC is hit. Per previous comments, that's a healthy performance hit, 10-20% is a good guestimate on lost job throughput. So we'd want to re-enable journaling once the bad actor filling up disk is corrected and disk space is available again. Brainstorming ...
maybe there's something more clever |
Maybe |
ahhh. That's a good idea. I didn't think of doing it that way. I guess in my mind the question is how to inform the admins something in flux is running in a bit of a "degraded mode" and needs to be improved, thus my thinking of a Edit: hmmm, I guess this solution would require admins to configure journal_mode and synchronous mode in a content-sqlite.toml file. Which is not super-wonderful, but is not horrible, especially if they are pre-allocating space anyways. |
So wanting to test how fast preallocate was going to be once its used in production, preallocated about 50g and stuck some timings in rc1 to get some estimates. This is in my fluxorama image on my laptop.
eek, that's sorta bad. on average we're looking at around 6-8 minutes. on corona, I got 6-11 minutes sooooo not great. I'm not sure if this is a deal breaker or not. I was going to try smaller "block sizes" to see if they have any effect. But perhaps we need to look at alternate ideas. One could be to pre-allocate while flux is running vs on initialization. Update: doh! 64 meg blocks even worse, generally speaking 10+ minutes on mylaptop |
I haven't really looked at the code closely so you might already be doing this but:
|
yup (well `sqlite3_bind_zeroblob() which is hopefully identical)
yup
Ahhh good guess. I was doing Edit: on corona, times are hovering 5-6 mins now ... a bit better |
So just brainstorming some preallocation ideas that would be faster.
New ideas
New idea2
|
some extra testing increasing sqlite page size from its default 4K to 64K improved things a tad (perhaps 6 mins to 5 mins on corona).
unfortunately 64K is the max page size for sqlite. collapsing all writes into a single transaction didn't seem to do much, which in hindsight makes sense. Given 50g isn't only on the order of 50-75 writes of 800-900 megs. Not enough to make a huge impact.
tried to increase cache size, thinking maybe that'll help with whatever indexing or whatever sqlite does internally. didn't seem to do much.
(side note that 2048 is number of pages, so that 2048*64K for cache size) Thought maybe using only preparde statements w/ pre-defined zeroblobs might be faster, but that didn't seem to do much.
Smaller "blocks" of writes didn't seem to change much. |
i noticed something as I was playing around. Some timings I stuck into the code for debug. For a pre-allocation of 50g
the actual time to open db, write cruft, drop db, close db, takes about 3 minutes (181 seconds). The majority of the time That is probably a clue. Initial guess, the writes have not yet internally completed whatever is going on (fsync-ing? indexing kruft?), so the drop command has to wait for their completion? Not sure what. one experimental idea I had was to create all the kruft, then delete it, then drop the table. Dunno if that would speed things up. But also, my call to
about 306 seconds. So something else appears to be chewing up time unrelated to the pre-allocation. I haven't dug into what yet. Dunno if mountains of memory pages taken up and there's something else going on in there. |
Ahhhh ... re-opening the database a second time is a huge part of the slowdown. That adds 2 minutes on top of the pre-allocate.
That also means the current implementation in the PR, which would close the db and re-open it if we hit ENOSPC, would be quite slow. Hmmm, gonna take some thought. |
@chu11, I think we should pause this idea. It's starting to sound like more trouble than it's worth. |
Well I was thinking about this last night and this morning. Perhaps this discussion should be more about pros vs cons or tradeoffs we want to accept. If surviving ENOSPC is something we care about, then there simply has to be a cost. Current implementation I have, that cost is increased startup time. If we accept smaller ENOSPC survival odds, we can decrease startup time via a smaller preallocation. We can reduce startup time if we trade off job throughput (i.e. default from journal_mode=WAL to OFF). If we implement one of several of the brainstorming ideas I list above (e.g. preallocate background) we tradeoff ENOSPC survival odds (i.e. hit ENOSPC while we are preallocating). etc. etc. In my mind, an extra 5 mins startup time doesn't seem like a deal breaker for preallocating 50g. I dunno ... is it? (as an aside, I just thought of a new idea in my brainstorm list above. instead of preallocating a fixed amount of space ... pre-allocating an amount of space larger than the current. i.e. "10g more" than whatever the current db size is. this could even be done in rc1 but checking the old DB size and doing some math) Edit: I just saw you commented in the PR, didn't see that till after I wrote this. |
Problem: when the disk containing the sqlite database fills up on rank 0, flux does not handle it well, e.g.
@trws suggested providing an option to pre-allocate space in the database.
One approach is to write a huge blob to the db and remove it:
https://stackoverflow.com/questions/827010/sqlite-pre-allocating-database-size
Some steps may also need to be taken to ensure writes to
content.sqlite-wal
do not fail. More study required on that one:https://www.sqlite.org/wal.html
The text was updated successfully, but these errors were encountered: