Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin fate improvements, LockIDs use for fate stores improved/fixed #5028

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kevinrr888
Copy link
Member

This PR makes several improvements/fixes: improvements to the admin fate util which were made possible from #4524, replaced incorrect use of createDummyLockID() in real code (now only used in tests), <User/Meta>FateStore now support a null lock id if they will be used as read-only stores: write ops will fail on a store with a null lock, and some other misc. changes.

Full list of changes:

  • Removed the check for a dead Manager in the Admin fate util (AdminUtil) which was checked before admin fate delete <tx> or admin fate fail <tx> was able to run. This check is no longer needed with the changes from Fate reservations moved out of memory #4524. Fate reservations moved out of memory #4524 moved reservations out of Manager memory into the FATE table (for UserFateStore) and into ZK (for MetaFateStore). Prior to this, the Admin process would have no way of knowing if the Manager process had a transaction reserved, so the Manager had to be shutdown to ensure it was not. But now that reservations are visible to any process, we can try to reserve the transaction in Admin, and if it cannot be reserved and deleted/failed in a reasonable time, we let the user know that the Manager would need to be shutdown if deleting/failing the transaction is still desired.
    • This has several benefits:
    • It is one less thing to worry about when implementing multiple managers in the future since Admin assumes only one Manager for these commands. However, there is still the case where the Manager may
      keep a transaction reserved for a long period of time and the Admin can never reserve it. In this case, we inform the user that the transaction could not be deleted/failed and that if deleting/failing
      is still desired, the Manager may need to be shutdown.
    • It covers a potential issue in the previously existing code where there was nothing stopping or ensuring a Manager is not started after the check is already performed in Admin but before the delete/
      fail was executed.
    • It also should make the commands easier to use now since the Manager is not required to be shutdown before use.
  • Changes and adds some tests for admin fate fail and admin fate delete: ensures the Manager is not required to be down to fail/delete a transaction, and ensures that if the Manager does have a transaction reserved, admin will fail to reserve and fail/delete the transaction.
  • Another change which was needed as a prerequisite for the above changes was creating a ZK lock for Admin so transactions can be properly reserved by the command. Added new constant Constants.ZADMIN_LOCK = "/admin/lock", changed ServiceLockPaths, and added Admin.createAdminLock() to support this
  • New class TestLock (in test package) which is used by tests to create a real ZK lock, or a dummy one. Removed createDummyLockID() from AbstractFateStore (moved to TestLock), and createDummyLock() is now only used in test code. Added new constant ZTEST_LOCK = "/test/lock", changed ServiceLockPaths, and added createTestLock() which is used to create a real lock id (one held in ZK) which is needed for some tests.
    • This fixes an unexpected failure that could have occurred for ExternalCompaction_1_IT. Was using a dummy lock for the store before and the fate data was being stored in the same locations that the
      Manager uses for it's fate data. The DeadReservationCleaner running in Manager would have cleaned up reservations created using this store if it ran when reservations were present. Now the test creates
      a real ZK lock so the DeadReservationCleaner won't clean these up unexpectedly.
  • Stores now support a null lock id for the situation where they will be used as read-only stores. A store with a null lock id will fail on write ops. Changed all existing uses of stores to only have a lock id if writes will occur (previously, all instances of the stores had a lock id).
  • Removed unused or unneccesary constructors for AbstractFateStore, MetaFateStore, UserFateStore
  • Ensured all tests changed, all FATE tests, and sunny day tests still pass

closes #4904

This commit makes several improvements/fixes: improvements to the `admin fate` util which were made possible from apache#4524, replaced incorrect use of `createDummyLockID()` in real code (now only used in tests), `<User/Meta>FateStore` now support a null lock id if they will be used as read-only stores: write ops will fail on a store with a null lock, and some other misc. changes.

Full list of changes:
- Removed the check for a dead Manager in the Admin fate util (AdminUtil) which was checked before `admin fate delete <tx>` or `admin fate fail <tx>` was able to run. This check is no longer needed with the changes from apache#4524. apache#4524 moved reservations out of Manager memory into the FATE table (for UserFateStore) and into ZK (for MetaFateStore). Prior to this, the Admin process would have no way of knowing if the Manager process had a transaction reserved, so the Manager had to be shutdown to ensure it was not. But now that reservations are visible to any process, we can try to reserve the transaction in Admin, and if it cannot be reserved and deleted/failed in a reasonable time, we let the user know that the Manager would need to be shutdown if deleting/failing the transaction is still desired.
	- This has several benefits:
	- It is one less thing to worry about when implementing multiple managers in the future since Admin assumes only one Manager for these commands. However, there is still the case where the Manager may
	keep a transaction reserved for a long period of time and the Admin can never reserve it. In this case, we inform the user that the transaction could not be deleted/failed and that if deleting/failing
	is still desired, the Manager may need to be shutdown.
	- It covers a potential issue in the previously existing code where there was nothing stopping or ensuring a Manager is not started after the check is already performed in Admin but before the delete/
	fail was executed.
	- It also should make the commands easier to use now since the Manager is not required to be shutdown before use.
- Changes and adds some tests for `admin fate fail` and `admin fate delete`: ensures the Manager is not required to be down to fail/delete a transaction, and ensures that if the Manager does have a transaction reserved, admin will fail to reserve and fail/delete the transaction.
- Another change which was needed as a prerequisite for the above changes was creating a ZK lock for Admin so transactions can be properly reserved by the command. Added new constant `Constants.ZADMIN_LOCK = "/admin/lock"`, changed `ServiceLockPaths`, and added `Admin.createAdminLock()` to support this
- New class `TestLock` (in test package) which is used by tests to create a real ZK lock, or a dummy one. Removed `createDummyLockID()` from `AbstractFateStore` (moved to TestLock), and `createDummyLock()` is now only used in test code. Added new constant `ZTEST_LOCK = "/test/lock"`, changed `ServiceLockPaths`, and added `createTestLock()` which is used to create a real lock id (one held in ZK) which is needed for some tests.
	- This fixes an unexpected failure that could have occurred for `ExternalCompaction_1_IT`. Was using a dummy lock for the store before and the fate data was being stored in the same locations that the
	Manager uses for it's fate data. The DeadReservationCleaner running in Manager would have cleaned up reservations created using this store if it ran when reservations were present. Now the test creates
	a real ZK lock so the DeadReservationCleaner won't clean these up unexpectedly.
- Stores now support a null lock id for the situation where they will be used as read-only stores. A store with a null lock id will fail on write ops. Changed all existing uses of stores to only have a lock id if writes will occur (previously, all instances of the stores had a lock id).
- Removed unused or unneccesary constructors for AbstractFateStore, MetaFateStore, UserFateStore
- Ensured all tests changed, all FATE tests, and sunny day tests still pass

closes apache#4904
@kevinrr888 kevinrr888 self-assigned this Oct 31, 2024
@kevinrr888 kevinrr888 added this to the 4.0.0 milestone Oct 31, 2024
AdminLockWatcher lw = new AdminLockWatcher();
ServiceLockData.ServiceDescriptors descriptors = new ServiceLockData.ServiceDescriptors();
descriptors.addService(new ServiceLockData.ServiceDescriptor(uuid,
ServiceLockData.ThriftService.NONE, "localhost", Constants.DEFAULT_RESOURCE_GROUP_NAME));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about "localhost" here or if it even matters

Copy link
Contributor

@keith-turner keith-turner Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use something that is more distinct than localhost like fake_admin_util_host.

TestLockWatcher lw = new TestLockWatcher();
ServiceLockData.ServiceDescriptors descriptors = new ServiceLockData.ServiceDescriptors();
descriptors.addService(new ServiceLockData.ServiceDescriptor(uuid,
ServiceLockData.ThriftService.NONE, "localhost", Constants.DEFAULT_RESOURCE_GROUP_NAME));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same "localhost" comment

* stores. Can always use {@link TestLock#createTestLock(ServerContext)} to be safe if unsure
* which to use.
*/
public static ZooUtil.LockID createDummyLockID() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't technically needed anymore: could always just use createTestLock() instead, but left it since it's quicker than creating a ZK lock when that isn't necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving its seems ok to me since its in the test module.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good, made some minor comments but did not see problems.

@@ -91,6 +91,8 @@ public class Constants {

public static final String ZTABLE_LOCKS = "/table_locks";
public static final String ZMINI_LOCK = "/mini";
public static final String ZADMIN_LOCK = "/admin/lock";
public static final String ZTEST_LOCK = "/test/lock";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to drop this constant and have the test code create locks under the admin path when needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't possible with how FateOpsCommandsIT uses locks. It uses a test lock (for the test process) and also an Admin lock (for the admin process).

@@ -275,6 +269,10 @@ protected void verifyFateKey(FateId fateId, Optional<FateKey> fateKeySeen,
"Collision detected for fate id " + fateId);
}

protected void verifyLock(ZooUtil.LockID lockID, FateId fateId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore this comment, it was something I noticed but not sure about what if any changes to make based on this observation.

In every place where this method is called the next line is something like FateReservation.from(lockID, UUID.randomUUID()). The call to FateReservation.from will fail if the lockId is null. So this method may be redundant, however its error message is more informative because it includes the fate id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is redundant but, like you said, I thought it might be better to have the more detailed error message at the cost of double-checking the lock id for null

Comment on lines 566 to 568
log.error("Could not {} {} in a reasonable time. This indicates the Manager is currently "
+ "working on {}. If {} {} is still desired, the Manager needs to be stopped and "
+ "the command needs to be rerun.", op, fateId, fateId, op, fateId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice error message. Could make it repeat info a bit less.

Suggested change
log.error("Could not {} {} in a reasonable time. This indicates the Manager is currently "
+ "working on {}. If {} {} is still desired, the Manager needs to be stopped and "
+ "the command needs to be rerun.", op, fateId, fateId, op, fateId);
log.error("Could not {} {} in a reasonable time. This indicates the Manager is currently "
+ "working on it. The manager may need to be stopped to complete this command.", op, fateId);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a bit much. Fixed.

boolean fail;

@Parameter(names = {"-d", "--delete"},
description = "<FateId>... Delete locks associated with transactions (Requires Manager to be down)")
description = "<FateId>... Delete locks associated with transactions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This help message is not quite correct. This was a preexisting problem though.

Suggested change
description = "<FateId>... Delete locks associated with transactions")
description = "<FateId>... Delete fate transaction and its associated table locks")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch... I missed that. This has been fixed by #5080 in 2.1+

class AdminLockWatcher implements ServiceLock.AccumuloLockWatcher {
@Override
public void lostLock(ServiceLock.LockLossReason reason) {
log.warn("Lost lock: " + reason.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to halt the process after the warning. If we are no longer holding the lock, then any reservations may no longer be valid. Could use org.apache.accumulo.core.util.Halt.

}
if (zk.exists(lockPath, false) == null) {
zk.create(lockPath, new byte[0], ZooUtil.PUBLIC, CreateMode.PERSISTENT);
log.info("Created: {}", lockPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("Created: {}", lockPath);
log.info("Created: {} in zookeeper", lockPath);

ServiceLockData.ThriftService.NONE, "localhost", Constants.DEFAULT_RESOURCE_GROUP_NAME));
ServiceLockData sld = new ServiceLockData(descriptors);
String lockPath = slp.toString();
String parentLockPath = lockPath.substring(0, lockPath.indexOf("/lock"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to avoid this literal "/lock" if possible. Maybe could use lockPath.split("/") and work down the nodes in that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Use lastIndexOf("/") now

AdminLockWatcher lw = new AdminLockWatcher();
ServiceLockData.ServiceDescriptors descriptors = new ServiceLockData.ServiceDescriptors();
descriptors.addService(new ServiceLockData.ServiceDescriptor(uuid,
ServiceLockData.ThriftService.NONE, "localhost", Constants.DEFAULT_RESOURCE_GROUP_NAME));
Copy link
Contributor

@keith-turner keith-turner Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use something that is more distinct than localhost like fake_admin_util_host.

*/
protected void stopManager() throws InterruptedException, IOException {
getCluster().getClusterControl().stopAllServers(ServerType.MANAGER);
Thread.sleep(20_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sleep here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sleep originated from when this class was initially created. Looking back at our slack messages, found that the purpose of this sleep was to make sure the Manager lock was no longer held after shutdown (shutdown the Manager, wait 20 seconds (2x the Property.INSTANCE_ZK_TIMEOUT value we set)). This made sure when we checked if the Manager lock was held before calling some of our admin fate utils which required the Manager to be down, it would not unexpectedly fail. But we no longer do that check here, so I think you're right and this isn't needed anymore. I forgot the reasoning for this sleep and just assumed it was needed to ensure the Manager was fully shutdown.

@@ -341,8 +341,8 @@ private void testDeadReservationsCleanup(TestStoreFactory<LatchTestEnv> testStor

// Create the new Fate/start the Fate threads (the work finder and the workers).
// Don't run another dead reservation cleaner since we already have one running from fate1.
FastFate<LatchTestEnv> fate2 = new FastFate<>(testEnv2, store2, false, Object::toString,
DefaultConfiguration.getInstance());
Fate<LatchTestEnv> fate2 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this code needlessly using FastFate before? Its not running the dead reservation cleaner, so it does not need to use FastFate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was not needed since it isn't running the cleaner. Wasn't wrong before, but is more clear as just Fate

@kevinrr888
Copy link
Member Author

kevinrr888 commented Nov 21, 2024

One thing I noticed when working on this and running the tests again was that in a couple of the new tests added here (testFate<Delete/Fail>CommandTimeout), the test would fail/timeout on fate.shutdown(). There appears to still be one active transaction worker running (Fate <META/USER> is waiting for worker threads to terminate is logged until timeout). This only happens sometimes and I lowered the shutdown timeout to avoid a test failure in this case (this just avoids a test failure and allows the shutdownNow to be called on the pool, but doesn't fix the issue). It seems strange to me that a worker would still be running... there is only one transaction which is even attempted to be reserved (the one we seed in the test) and that is always followed by an unreserve (verified with logs). @keith-turner maybe you have some ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace AbstractFateStore.createDummyLockID()
2 participants