rgw/sfs: honor retry_raced_bucket_write mechanism #240

giubacc · 2023-10-26T13:52:52Z

Updating bucket's metadata concurrently by two or more threads is allowed in radosgw.
There is a retry mechanism: retry_raced_bucket_write(), that expects the bucket references to fetch the latest data from the persistent store.
rgw/sfs driver didn't implement try_refresh_info() in its bucket class definition; this could cause two references to the same bucket to potentially lead to partial metadata updates.

Fixes: https://github.com/aquarist-labs/s3gw/issues/637

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

giubacc · 2023-10-26T15:48:58Z

@jecluis I need some help to properly configure the formatter in vscode, it seems correct ... but I'm struggling a bit understanding why it keeps formatting a bit not in the right way 😓

jecluis · 2023-10-26T18:15:10Z

@jecluis I need some help to properly configure the formatter in vscode, it seems correct ... but I'm struggling a bit understanding why it keeps formatting a bit not in the right way 😓

Alright. Ping me tomorrow and we can set up a call to figure out what's going on.

src/test/rgw/sfs/test_rgw_sfs_sfs_bucket.cc

tserong

LGTM, but I'm not so familiar with this part of the code, so could do with approval from someone else.

src/rgw/driver/sfs/bucket.cc

irq0 · 2023-10-31T10:26:42Z

src/rgw/driver/sfs/bucket.cc

@@ -534,11 +534,39 @@ int SFSBucket::abort_multiparts(
  return sfs::SFSMultipartUploadV2::abort_multiparts(dpp, store, this);
 }

+/**
+ * @brief Refresh this bucket's rep with the state obtained from the store.


What does 'rep' mean?

This is something coming from the days at uni, apparently in Pisa all professors kept referring to an object's representation with the term: rep.
An object rep refers generically to how an object type keeps organized its state. The same interface could, normally, be represented with different reps.
Eg: a Map could be implemented with a Tree rep or with a Hash rep depending on the usage that makes one implementation more fit with a certain workload.
I always liked that term, so I tend to use that; I dunno how much could be standard anyway :).

The only references I find about "object representation" is about how bytes are organized (see reference). I don't think this translates to the way you are using the term, so maybe just drop it and replace that with something less confusing?

Agree, this can be definitely bad to use.
What about simply use:

Refresh this bucket object with the state obtained from the store. Indeed it can happen that the state of this bucket is obsolete due to concurrent threads updating metadata using their own SFSBucket instance.

irq0 · 2023-10-31T10:33:27Z

src/rgw/driver/sfs/bucket.cc

+  // maybe excessively safe approach: instead of directly assign the returned
+  // value to this.bucket let's check first if bref is valid.
+  if (!bref) {
+    lsfs_dout(dpp, 0) << fmt::format("no such bucket! {}", get_name()) << dendl;


There might have been a concurrent bucket delete + GC, right? This would mean this isn't 'excessively safe', just normal way of doing business. -ERR_NO_SUCH_BUCKET sounds sensible to me, logging an error not (because this is normal behavior). Suggest to remote the message or make it debug + improve it. There isn't much to debug on with 'no such bucket! NAME' in hand.

I agree that there could be a concurrent delete, I will remove the comment.
I think the "error" logging message has its own dignity to exist, maybe lowering at warning level, this could help to spot a client bug in its application logic (a client issuing a concurrent delete + update metadata on the same bucket could smell a bit) .
How would you improve the message anyway (what fields dump)?

From our perspective it is hard to tell if this is a client issue or normal behavior. There might be concurrent clients doing this operation. It is not our job to log warnings for legal but possible weird client behavior. If someone wants to debug this, there is the operation log feature and we already log that the bucket doens't exists with the standard request log that includes the error return. A "no such bucket" warning is redundant at best.

ok, will remove that.

irq0 · 2023-10-31T10:36:56Z

src/rgw/driver/sfs/bucket.cc

+  }
+  bucket = bref;
+
+  // update views like we do in the constructor


If this is like we do in the constructor this screams refactoring - we should not duplicate code that updates the in memory view.

agree! will improve

irq0 · 2023-10-31T10:38:12Z

src/test/rgw/sfs/test_rgw_sfs_sfs_bucket.cc

+  auto ceph_context = std::make_shared<CephContext>(CEPH_ENTITY_TYPE_CLIENT);
+  ceph_context->_conf.set_val("rgw_sfs_data_path", getTestDir());
+  ceph_context->_log->start();
+  auto store = new rgw::sal::SFStore(ceph_context.get(), getTestDir());


test dir cleanup code missing

Apologies, what do you mean with that?

The code above creates a new store at getTestDir, but never deletes it. See the other unit tests teardown hooks, there is usually a delete in there.

but isn't this function supposed to be called for every test:
void TearDown() override ?
am I missing something here?

irq0 · 2023-10-31T10:44:49Z

src/test/rgw/sfs/test_rgw_sfs_sfs_bucket.cc

+  // obj lock structure in memory cannot be equal at this point in memory for the
+  // two b1 and b2 objects; this simulates two threads updating metadata over the


Why cant they be equal?

Per test construction they cannot be equal, b1 memory is modified with merge_and_store_attrs() and b2 memory is modified with put_info().
Maybe better change in the comment the b1 and b2 aliases with their correct names ( bucket_from_store_1 and bucket_from_store_2) ?

I think it is worth adding a bit more of an explanation in this case. What you just replied on this thread, for instance.

irq0 · 2023-10-31T10:59:28Z

src/rgw/driver/sfs/bucket.cc

+ * @brief Refresh this bucket's rep with the state obtained from the store.
+ *        Indeed it can happen that the rep of this bucket is obsolete due to
+ *        concurrent threads updating metadata using their own SFSBucket instance.
+ */
 int SFSBucket::try_refresh_info(


This might be only half the solution. I didn't dig too deep, but hope you know more :)

If I inline and simplify the logic a bit we have:

int retry_raced_bucket_write(const DoutPrefixProvider *dpp, rgw::sal::Bucket* b, const F& f) { auto r = f(); for (auto i = 0u; i < 15u && r == -ECANCELED; ++i) { r = b->try_refresh_info(dpp, nullptr); if (r >= 0) { rgw::sal::Attrs attrs = s->bucket->get_attrs(); attrs[RGW_ATTR_TAGS] = tags_bl; r = s->bucket->merge_and_store_attrs(this, attrs, y); } } return r; }

What if something goes state between try_refresh_info and merge_and_store_attrs? Is merge_and_store_attrs responsible to return -ECANCELED when it detects staleness? Can we detect merge conflicts there?

If I have to say the truth, I don't think the retry_raced_bucket_write can solve the problem for all cases, it can only mitigate the problem.
Imagine 2 physical threads A and B and a bucket with an initial state S0, they both race and run true-parallel on their own CPU core the try_refresh_info function; at this point for both A and B the state of the bucket in memory is S0, they both think the S0 state is the latest one and they happily update the bucket both with their new data.
Now, because in any decent implementation, an update will be done mutual exclusively, A or B will run a portion of critical code in a serialized way, so in the end, either the state of A or the state of B will prevaricate on the other.
Is this an error? I don't think so, it is part of the game when you allow a bucket to be modified concurrently by more than 1 physical thread.
At a certain point, a thread will have to say: "Can I safely assume this state in my cached memory is good enough?"; it doesn't matter how much far you go checking this, at a certain point you simply must go on.
This retry_raced_bucket_write mechanism can work well with logical threads mapped over 1 physical thread.
In this case they can always be serialized in some way; in this case the retry_raced_bucket_write solves something it is correct to solve because the 2 logical threads run over a single core as if they where a single logical sequence of statements.
This is the reasoning I've done myself trying to figure out what were the original "meaning and purposes" of the original authors of this code.
Let's always keep in mind that the actual Truth, until we are working downstream, and dealing with parts developed upstream will be hard to be completely clarified; we would need the support of the original developers to really understand what was the aim of something.
The best we can do now is to hope that our guess is good enough :) .

Let's assume we are in s->bucket->merge_and_store_attrs(this, new, y);

State we have: database state db, our current copy in bucket current and the update in new.
Caller based new on current fetched from db.

We must transition from db to new. Right now we transition form current to new. This is a bug, because current may be stale. This is the case if current != db.

If we make the update transaction conditional on current == db, we would transition from db to new in turn. If at time of the update current != db, we fail with -ECANCELED and let the retry logic compute a fresh update.

Ok, now I think I got what you mean, thanks; I try to explain the limit case so we can discuss from that:
Let's assume both threads A and B reach update() concurrently.
Let's assume that update() is the function that flushes a state on the db and cannot be executed by 2 threads concurrently because mutex protected.

thread A

try_update_data() -> fetch_from_db() -> db_state_0 -> current_state = db_state_0 update(current_state + new_A)

now bucket's state on db is db_state_A

thread B

try_update_data() -> fetch_from_db() -> db_state_0 -> current_state = db_state_0 update(current_state_0 + new_B)

now bucket's state on db is db_state_B

final state on db: db_state_B

db_state_A != db_state_B

new_A is lost.

To avoid the scenario above, we should implement a mechanism inside the update() mutex protected function (and it must be only that way for the following check).
So update() should be something like:

int update(const State &current, const &new_delta) { mutex.lock() defer mutex.unlock() db_state = _fetch_db_state() if current != db_state return -ECANCELED return _update(current+new_delta); }

A good candidate for the update() function could/should be sfs::get_meta_buckets(get_store().db_conn) ->store_bucket(), correct?

I think this needs to be one layer deeper in the database update. Our Bucket code still keeps a shared copy of the database representation in a map, so this might be a bit more tricky with the database access and mutex around the map.

In essence I think we need something like 'update bucket set attrs = new attrs where attrs = cached attrs'

If that transaction didn't change anything, bail out and let the SAL layer update and retry.

Unfortunately I don't think can't say 'attrs = cached attrs', because both are ceph serialized blobs. A transaction that fetches, deserializes, compares and conditionally rolls back / commits may work.

I've done a patch in this direction that seemed promising, but in the end I stumbled in an issue that IMO requires comments from upstream, so I've opened: https://tracker.ceph.com/issues/63560

github-actions · 2023-11-06T05:10:08Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Updating bucket's metadata concurrently by two or more threads is allowed in radosgw. There is a retry mechanism: retry_raced_bucket_write(), that expects the bucket references to fetch the latest data from the persistent store. rgw/sfs driver didn't implement try_refresh_info() in its bucket class definition; this could cause two references to the same bucket to potentially lead to partial metadata updates. Fixes: https://github.com/aquarist-labs/s3gw/issues/637 Signed-off-by: Giuseppe Baccini <[email protected]>

giubacc force-pushed the retry-bucket-raced-metadata-write branch from 99d28ed to 8e46b3a Compare October 26, 2023 15:31

jecluis requested review from jecluis and irq0 October 26, 2023 18:15

jecluis added kind/enhancement Change that positively impacts existing code area/rgw-sfs RGW & SFS related labels Oct 26, 2023

jecluis reviewed Oct 27, 2023

View reviewed changes

src/test/rgw/sfs/test_rgw_sfs_sfs_bucket.cc Outdated Show resolved Hide resolved

giubacc requested review from 0xavi0 and tserong October 27, 2023 10:32

giubacc force-pushed the retry-bucket-raced-metadata-write branch from 8e46b3a to ce2bf84 Compare October 27, 2023 12:25

tserong reviewed Oct 30, 2023

View reviewed changes

src/rgw/driver/sfs/bucket.cc Outdated Show resolved Hide resolved

irq0 reviewed Oct 31, 2023

View reviewed changes

giubacc force-pushed the retry-bucket-raced-metadata-write branch from ce2bf84 to 1846c13 Compare November 2, 2023 10:10

giubacc marked this pull request as draft November 3, 2023 15:01

github-actions bot added the needs-rebase Changes need a rebase label Nov 6, 2023

giubacc added this to the v0.23.0 milestone Nov 13, 2023

giubacc self-assigned this Nov 13, 2023

vmoutoussamy added the priority/1 Should be fixed for next release label Nov 15, 2023

giubacc force-pushed the retry-bucket-raced-metadata-write branch from 1846c13 to 06ac4ab Compare November 17, 2023 10:53

jecluis modified the milestones: v0.23.0, v0.24.0 Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw/sfs: honor retry_raced_bucket_write mechanism #240

rgw/sfs: honor retry_raced_bucket_write mechanism #240

giubacc commented Oct 26, 2023 •

edited

Loading

giubacc commented Oct 26, 2023

jecluis commented Oct 26, 2023

tserong left a comment

irq0 Oct 31, 2023

giubacc Oct 31, 2023

jecluis Nov 2, 2023

giubacc Nov 2, 2023

irq0 Oct 31, 2023

giubacc Oct 31, 2023

irq0 Nov 3, 2023

giubacc Nov 3, 2023

irq0 Oct 31, 2023

giubacc Oct 31, 2023

irq0 Oct 31, 2023

giubacc Oct 31, 2023

giubacc Nov 2, 2023

irq0 Nov 3, 2023

giubacc Nov 3, 2023

irq0 Oct 31, 2023

giubacc Oct 31, 2023

jecluis Nov 13, 2023

irq0 Oct 31, 2023 •

edited

Loading

giubacc Oct 31, 2023

irq0 Nov 3, 2023

giubacc Nov 3, 2023

irq0 Nov 14, 2023

giubacc Nov 16, 2023

github-actions bot commented Nov 6, 2023

		// obj lock structure in memory cannot be equal at this point in memory for the
		// two b1 and b2 objects; this simulates two threads updating metadata over the

rgw/sfs: honor retry_raced_bucket_write mechanism #240

Are you sure you want to change the base?

rgw/sfs: honor retry_raced_bucket_write mechanism #240

Conversation

giubacc commented Oct 26, 2023 • edited Loading

Checklist

giubacc commented Oct 26, 2023

jecluis commented Oct 26, 2023

tserong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irq0 Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 6, 2023

giubacc commented Oct 26, 2023 •

edited

Loading

irq0 Oct 31, 2023 •

edited

Loading