Safely acquire lock on multiple tables #100

rkrage · 2024-06-05T22:07:52Z

This functionality will allow us to improve the safety of operations that take out locks on multiple tables (e.g. adding / removing foreign keys).

I also discovered a bug in the nested lock acquisition logic, so this PR fixes that and adds more tests.

I feel like it's tradition at this point for me to include too many changes in a PR and for @jcoleman to ask me to break it up into smaller commits 😆 (~~I will try to do that tomorrow~~ done).

rkrage · 2024-06-05T22:09:45Z

lib/pg_ha_migrations/safe_statements.rb

+        # we have already acquired the lock so this check is unnecessary.
+        # In fact, it could actually cause a deadlock if a blocking query
+        # was executed shortly after the initial lock acquisition.
+        break if nested_target_tables


This is the bug I was referring to in the PR description

rkrage · 2024-06-05T22:12:12Z

lib/pg_ha_migrations/safe_statements.rb

+        # so we need to check for blocking queries on those tables as well
+        target_tables_for_blocking_transactions = target_tables.flat_map do |target_table|
+          target_table.partitions(include_sub_partitions: true, include_self: true)
+        end


I moved this here so that we get a fresh list of tables for every iteration of the loop. I think this gives us extra safety if new partitions are added while this method is executing.

rkrage · 2024-06-05T22:17:23Z

lib/pg_ha_migrations/safe_statements.rb

        begin
-          method(adjust_timeout_method).call(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do
-            connection.execute("LOCK #{target_table.fully_qualified_name} IN #{target_table.mode.to_sql} MODE;")
+          adjust_statement_timeout(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do


Lock timeout applies to each individual relation in the query, while statement timeout applies to the entire query. So, if we were to use lock timeout here, the upper limit for the query timeout would be LOCK_TIMEOUT_SECONDS * <number of tables being locked> which seems incorrect

We should add a comment to the code explaining why we don't use the (seemingly) obvious lock_timeout GUC.

rkrage · 2024-06-05T22:17:57Z

lib/pg_ha_migrations/safe_statements.rb

+  # where the most recent method call is the last element.
+  def safely_acquire_lock_for_table_history
+    @safely_acquire_lock_for_table_history ||= []
+  end


This seems way cleaner than using a thread variable. Not sure why I did that in the first place...

Probably the argument is that you don't have to worry about how the method is invoked.

But maybe that means we should actually query Postgres for what locks we already currently hold instead of trying to track them ourselves?

jcoleman · 2024-06-14T17:52:41Z

lib/pg_ha_migrations/safe_statements.rb

+        # we have already acquired the lock so this check is unnecessary.
+        # In fact, it could actually cause a deadlock if a blocking query
+        # was executed shortly after the initial lock acquisition.
+        break if nested_target_tables


Does it make sense to run the loop at all in this case? I.e., why break out of the loop when we could make the whole loop conditional?

jcoleman · 2024-07-23T17:29:57Z

lib/pg_ha_migrations/relation.rb

@@ -30,8 +30,12 @@ def present?
      name.present? && schema.present?
    end

-    def ==(other)
-      other.is_a?(Relation) && name == other.name && schema == other.schema
+    def eql?(other)


I know this isn't new per se, but I'm wondering again why we don't include mode in equality/hashing (I understand why we include it in a special way in conflicts_with?).

jcoleman · 2024-07-24T13:49:01Z

lib/pg_ha_migrations/relation.rb

@@ -152,4 +156,26 @@ def valid?
      SQL
    end
  end
+
+  class TableCollection < Set


Inheriting from Ruby collection classes is usually considered pretty dangerous now (e.g., lots of discussions about some old ActiveSupport classes) because it's easy to end up with an inconsistent API or internally inconsistent data.

For example, I could construct an instance of this class and then add another table to the set and break the assumptions of the #mode method (namely that all modes are the same).

I think it would probably preferable to just maintain and internal ivar of the set instance and present a very limited external API.

jcoleman · 2024-07-24T13:51:37Z

lib/pg_ha_migrations/safe_statements.rb

        begin
-          method(adjust_timeout_method).call(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do
-            connection.execute("LOCK #{target_table.fully_qualified_name} IN #{target_table.mode.to_sql} MODE;")
+          adjust_statement_timeout(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do


We should add a comment to the code explaining why we don't use the (seemingly) obvious lock_timeout GUC.

jcoleman · 2024-07-24T13:54:37Z

lib/pg_ha_migrations/safe_statements.rb

+  # where the most recent method call is the last element.
+  def safely_acquire_lock_for_table_history
+    @safely_acquire_lock_for_table_history ||= []
+  end


Probably the argument is that you don't have to worry about how the method is invoked.

But maybe that means we should actually query Postgres for what locks we already currently hold instead of trying to track them ourselves?

jcoleman · 2024-07-24T14:02:13Z

lib/pg_ha_migrations/safe_statements.rb

+  # The order of the array represents the current call stack,
+  # where the most recent method call is the last element.
+  def safely_acquire_lock_for_table_history
+    @safely_acquire_lock_for_table_history ||= []


Do we explicitly want to allow nested lock acquisition? If I'm reading the source directly, we didn't explicitly support nested lock acquisition before per se, though we allowed it (as long as it wasn't for the same table).

The nested checks were introduced as a result of #39 to ensure that we didn't upgrade locks, but that was focused on preventing one kind of bug rather than explicitly allowing nested locks. AFAICT it was only incidentally allowed before that.

I can't remember if we need that support for e.g. partitions. But without good justification I'm feeling a bit squeamish about supporting it: it's very easy to get into a bad situation here for multiple reasons:

Nested calls have the same problems vis-a-vis lock timeouts as allowing transactional DDL. You end up with nested timeouts, as well, which breaks our guarantees around timing.

It's easy to order calls in different ways (even with respect to SQL/DML that's executing externally to the migrations!) that could cause deadlocks.

I'm wondering if we should focus instead on safer APIs (like the multiple table locking one here) that target specific use cases directly.

Note: I'm not sure that the concern about deadlocks (2) is actually prevented by this approach either, because I assume Postgres acquires locks on the relations in order of their presentation here...so maybe that's unavoidable.

rkrage commented Jun 5, 2024

View reviewed changes

rkrage force-pushed the multi-table-locks branch 3 times, most recently from 9541a0a to c9fdfaa Compare June 7, 2024 14:33

rkrage force-pushed the multi-table-locks branch 8 times, most recently from 4142ddb to 2cfa43d Compare June 20, 2024 12:33

lmcicat approved these changes Jun 26, 2024

View reviewed changes

rkrage force-pushed the multi-table-locks branch from 2cfa43d to d159713 Compare July 5, 2024 15:10

rkrage added 3 commits July 10, 2024 15:32

Safely acquire lock on multiple tables

2d97d14

Fix potential deadlock in nested locking

9441f7c

Test to ensure lock is released after failed query

0660b33

rkrage force-pushed the multi-table-locks branch from d159713 to 0660b33 Compare July 10, 2024 15:34

jcoleman reviewed Jul 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safely acquire lock on multiple tables #100

Safely acquire lock on multiple tables #100

rkrage commented Jun 5, 2024 •

edited

Loading

rkrage Jun 5, 2024

rkrage Jun 5, 2024

rkrage Jun 5, 2024

jcoleman Jul 24, 2024

rkrage Jun 5, 2024

jcoleman Jul 24, 2024

jcoleman Jun 14, 2024

jcoleman Jul 23, 2024

jcoleman Jul 24, 2024

jcoleman Jul 24, 2024

jcoleman Jul 24, 2024

jcoleman Jul 24, 2024

Safely acquire lock on multiple tables #100

Are you sure you want to change the base?

Safely acquire lock on multiple tables #100

Conversation

rkrage commented Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkrage commented Jun 5, 2024 •

edited

Loading