-
Notifications
You must be signed in to change notification settings - Fork 10
rgw/sfs: honor retry_raced_bucket_write mechanism #240
base: s3gw
Are you sure you want to change the base?
rgw/sfs: honor retry_raced_bucket_write mechanism #240
Conversation
99d28ed
to
8e46b3a
Compare
@jecluis I need some help to properly configure the formatter in vscode, it seems correct ... but I'm struggling a bit understanding why it keeps formatting a bit not in the right way 😓 |
Alright. Ping me tomorrow and we can set up a call to figure out what's going on. |
8e46b3a
to
ce2bf84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'm not so familiar with this part of the code, so could do with approval from someone else.
src/rgw/driver/sfs/bucket.cc
Outdated
@@ -534,11 +534,39 @@ int SFSBucket::abort_multiparts( | |||
return sfs::SFSMultipartUploadV2::abort_multiparts(dpp, store, this); | |||
} | |||
|
|||
/** | |||
* @brief Refresh this bucket's rep with the state obtained from the store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'rep' mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something coming from the days at uni, apparently in Pisa all professors kept referring to an object's representation with the term: rep.
An object rep refers generically to how an object type keeps organized its state. The same interface could, normally, be represented with different reps.
Eg: a Map could be implemented with a Tree rep or with a Hash rep depending on the usage that makes one implementation more fit with a certain workload.
I always liked that term, so I tend to use that; I dunno how much could be standard anyway :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only references I find about "object representation" is about how bytes are organized (see reference). I don't think this translates to the way you are using the term, so maybe just drop it and replace that with something less confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this can be definitely bad to use.
What about simply use:
Refresh this bucket object with the state obtained from the store.
Indeed it can happen that the state of this bucket is obsolete due to
concurrent threads updating metadata using their own SFSBucket instance.
// maybe excessively safe approach: instead of directly assign the returned | ||
// value to this.bucket let's check first if bref is valid. | ||
if (!bref) { | ||
lsfs_dout(dpp, 0) << fmt::format("no such bucket! {}", get_name()) << dendl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might have been a concurrent bucket delete + GC, right? This would mean this isn't 'excessively safe', just normal way of doing business. -ERR_NO_SUCH_BUCKET sounds sensible to me, logging an error not (because this is normal behavior). Suggest to remote the message or make it debug + improve it. There isn't much to debug on with 'no such bucket! NAME' in hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there could be a concurrent delete, I will remove the comment.
I think the "error" logging message has its own dignity to exist, maybe lowering at warning level, this could help to spot a client bug in its application logic (a client issuing a concurrent delete + update metadata on the same bucket could smell a bit) .
How would you improve the message anyway (what fields dump)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our perspective it is hard to tell if this is a client issue or normal behavior. There might be concurrent clients doing this operation. It is not our job to log warnings for legal but possible weird client behavior. If someone wants to debug this, there is the operation log feature and we already log that the bucket doens't exists with the standard request log that includes the error return. A "no such bucket" warning is redundant at best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will remove that.
src/rgw/driver/sfs/bucket.cc
Outdated
} | ||
bucket = bref; | ||
|
||
// update views like we do in the constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is like we do in the constructor this screams refactoring - we should not duplicate code that updates the in memory view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree! will improve
auto ceph_context = std::make_shared<CephContext>(CEPH_ENTITY_TYPE_CLIENT); | ||
ceph_context->_conf.set_val("rgw_sfs_data_path", getTestDir()); | ||
ceph_context->_log->start(); | ||
auto store = new rgw::sal::SFStore(ceph_context.get(), getTestDir()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test dir cleanup code missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, what do you mean with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code above creates a new store at getTestDir, but never deletes it. See the other unit tests teardown hooks, there is usually a delete in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but isn't this function supposed to be called for every test:
void TearDown() override
?
am I missing something here?
// obj lock structure in memory cannot be equal at this point in memory for the | ||
// two b1 and b2 objects; this simulates two threads updating metadata over the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant they be equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per test construction they cannot be equal, b1
memory is modified with merge_and_store_attrs()
and b2
memory is modified with put_info()
.
Maybe better change in the comment the b1 and b2 aliases with their correct names ( bucket_from_store_1 and bucket_from_store_2) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth adding a bit more of an explanation in this case. What you just replied on this thread, for instance.
* @brief Refresh this bucket's rep with the state obtained from the store. | ||
* Indeed it can happen that the rep of this bucket is obsolete due to | ||
* concurrent threads updating metadata using their own SFSBucket instance. | ||
*/ | ||
int SFSBucket::try_refresh_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be only half the solution. I didn't dig too deep, but hope you know more :)
If I inline and simplify the logic a bit we have:
int retry_raced_bucket_write(const DoutPrefixProvider *dpp, rgw::sal::Bucket* b, const F& f) {
auto r = f();
for (auto i = 0u; i < 15u && r == -ECANCELED; ++i) {
r = b->try_refresh_info(dpp, nullptr);
if (r >= 0) {
rgw::sal::Attrs attrs = s->bucket->get_attrs();
attrs[RGW_ATTR_TAGS] = tags_bl;
r = s->bucket->merge_and_store_attrs(this, attrs, y);
}
}
return r;
}
What if something goes state between try_refresh_info and merge_and_store_attrs? Is merge_and_store_attrs responsible to return -ECANCELED when it detects staleness? Can we detect merge conflicts there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have to say the truth, I don't think the retry_raced_bucket_write
can solve the problem for all cases, it can only mitigate the problem.
Imagine 2 physical threads A and B and a bucket with an initial state S0
, they both race and run true-parallel on their own CPU core the try_refresh_info
function; at this point for both A and B the state of the bucket in memory is S0
, they both think the S0
state is the latest one and they happily update the bucket both with their new data.
Now, because in any decent implementation, an update will be done mutual exclusively, A or B will run a portion of critical code in a serialized way, so in the end, either the state of A or the state of B will prevaricate on the other.
Is this an error? I don't think so, it is part of the game when you allow a bucket to be modified concurrently by more than 1 physical thread.
At a certain point, a thread will have to say: "Can I safely assume this state in my cached memory is good enough?"; it doesn't matter how much far you go checking this, at a certain point you simply must go on.
This retry_raced_bucket_write
mechanism can work well with logical threads mapped over 1 physical thread.
In this case they can always be serialized in some way; in this case the retry_raced_bucket_write
solves something it is correct to solve because the 2 logical threads run over a single core as if they where a single logical sequence of statements.
This is the reasoning I've done myself trying to figure out what were the original "meaning and purposes" of the original authors of this code.
Let's always keep in mind that the actual Truth, until we are working downstream, and dealing with parts developed upstream will be hard to be completely clarified; we would need the support of the original developers to really understand what was the aim of something.
The best we can do now is to hope that our guess is good enough :) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume we are in s->bucket->merge_and_store_attrs(this, new, y);
State we have: database state db, our current copy in bucket
current and the update in new
.
Caller based new on current fetched from db.
We must transition from db to new. Right now we transition form current to new. This is a bug, because current may be stale. This is the case if current != db.
If we make the update transaction conditional on current == db, we would transition from db to new in turn. If at time of the update current != db, we fail with -ECANCELED and let the retry logic compute a fresh update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I think I got what you mean, thanks; I try to explain the limit case so we can discuss from that:
Let's assume both threads A and B reach update()
concurrently.
Let's assume that update()
is the function that flushes a state on the db and cannot be executed by 2 threads concurrently because mutex protected.
thread A
try_update_data()
-> fetch_from_db() -> db_state_0
-> current_state = db_state_0
update(current_state + new_A)
now bucket's state on db is db_state_A
thread B
try_update_data()
-> fetch_from_db() -> db_state_0
-> current_state = db_state_0
update(current_state_0 + new_B)
now bucket's state on db is db_state_B
final state on db: db_state_B
db_state_A != db_state_B
new_A is lost.
To avoid the scenario above, we should implement a mechanism inside the update()
mutex protected function (and it must be only that way for the following check).
So update()
should be something like:
int update(const State ¤t, const &new_delta)
{
mutex.lock()
defer mutex.unlock()
db_state = _fetch_db_state()
if current != db_state return -ECANCELED
return _update(current+new_delta);
}
A good candidate for the update() function could/should be sfs::get_meta_buckets(get_store().db_conn) ->store_bucket()
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be one layer deeper in the database update. Our Bucket code still keeps a shared copy of the database representation in a map, so this might be a bit more tricky with the database access and mutex around the map.
In essence I think we need something like 'update bucket set attrs = new attrs where attrs = cached attrs'
- If that transaction didn't change anything, bail out and let the SAL layer update and retry.
Unfortunately I don't think can't say 'attrs = cached attrs', because both are ceph serialized blobs. A transaction that fetches, deserializes, compares and conditionally rolls back / commits may work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a patch in this direction that seemed promising, but in the end I stumbled in an issue that IMO requires comments from upstream, so I've opened: https://tracker.ceph.com/issues/63560
ce2bf84
to
1846c13
Compare
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Updating bucket's metadata concurrently by two or more threads is allowed in radosgw. There is a retry mechanism: retry_raced_bucket_write(), that expects the bucket references to fetch the latest data from the persistent store. rgw/sfs driver didn't implement try_refresh_info() in its bucket class definition; this could cause two references to the same bucket to potentially lead to partial metadata updates. Fixes: https://github.com/aquarist-labs/s3gw/issues/637 Signed-off-by: Giuseppe Baccini <[email protected]>
1846c13
to
06ac4ab
Compare
Updating bucket's metadata concurrently by two or more threads is allowed in radosgw.
There is a retry mechanism: retry_raced_bucket_write(), that expects the bucket references to fetch the latest data from the persistent store.
rgw/sfs driver didn't implement try_refresh_info() in its bucket class definition; this could cause two references to the same bucket to potentially lead to partial metadata updates.
Fixes: https://github.com/aquarist-labs/s3gw/issues/637
Checklist