-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safely acquire lock on multiple tables #100
base: master
Are you sure you want to change the base?
Conversation
# we have already acquired the lock so this check is unnecessary. | ||
# In fact, it could actually cause a deadlock if a blocking query | ||
# was executed shortly after the initial lock acquisition. | ||
break if nested_target_tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bug I was referring to in the PR description
# so we need to check for blocking queries on those tables as well | ||
target_tables_for_blocking_transactions = target_tables.flat_map do |target_table| | ||
target_table.partitions(include_sub_partitions: true, include_self: true) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this here so that we get a fresh list of tables for every iteration of the loop. I think this gives us extra safety if new partitions are added while this method is executing.
begin | ||
method(adjust_timeout_method).call(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do | ||
connection.execute("LOCK #{target_table.fully_qualified_name} IN #{target_table.mode.to_sql} MODE;") | ||
adjust_statement_timeout(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock timeout applies to each individual relation in the query, while statement timeout applies to the entire query. So, if we were to use lock timeout here, the upper limit for the query timeout would be LOCK_TIMEOUT_SECONDS * <number of tables being locked>
which seems incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a comment to the code explaining why we don't use the (seemingly) obvious lock_timeout
GUC.
# where the most recent method call is the last element. | ||
def safely_acquire_lock_for_table_history | ||
@safely_acquire_lock_for_table_history ||= [] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems way cleaner than using a thread variable. Not sure why I did that in the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the argument is that you don't have to worry about how the method is invoked.
But maybe that means we should actually query Postgres for what locks we already currently hold instead of trying to track them ourselves?
9541a0a
to
c9fdfaa
Compare
4142ddb
to
2cfa43d
Compare
d159713
to
0660b33
Compare
# we have already acquired the lock so this check is unnecessary. | ||
# In fact, it could actually cause a deadlock if a blocking query | ||
# was executed shortly after the initial lock acquisition. | ||
break if nested_target_tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to run the loop at all in this case? I.e., why break out of the loop when we could make the whole loop conditional?
@@ -30,8 +30,12 @@ def present? | |||
name.present? && schema.present? | |||
end | |||
|
|||
def ==(other) | |||
other.is_a?(Relation) && name == other.name && schema == other.schema | |||
def eql?(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't new per se, but I'm wondering again why we don't include mode
in equality/hashing (I understand why we include it in a special way in conflicts_with?
).
@@ -152,4 +156,26 @@ def valid? | |||
SQL | |||
end | |||
end | |||
|
|||
class TableCollection < Set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inheriting from Ruby collection classes is usually considered pretty dangerous now (e.g., lots of discussions about some old ActiveSupport classes) because it's easy to end up with an inconsistent API or internally inconsistent data.
For example, I could construct an instance of this class and then add another table to the set and break the assumptions of the #mode
method (namely that all modes are the same).
I think it would probably preferable to just maintain and internal ivar of the set instance and present a very limited external API.
begin | ||
method(adjust_timeout_method).call(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do | ||
connection.execute("LOCK #{target_table.fully_qualified_name} IN #{target_table.mode.to_sql} MODE;") | ||
adjust_statement_timeout(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a comment to the code explaining why we don't use the (seemingly) obvious lock_timeout
GUC.
# where the most recent method call is the last element. | ||
def safely_acquire_lock_for_table_history | ||
@safely_acquire_lock_for_table_history ||= [] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the argument is that you don't have to worry about how the method is invoked.
But maybe that means we should actually query Postgres for what locks we already currently hold instead of trying to track them ourselves?
# The order of the array represents the current call stack, | ||
# where the most recent method call is the last element. | ||
def safely_acquire_lock_for_table_history | ||
@safely_acquire_lock_for_table_history ||= [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we explicitly want to allow nested lock acquisition? If I'm reading the source directly, we didn't explicitly support nested lock acquisition before per se, though we allowed it (as long as it wasn't for the same table).
The nested checks were introduced as a result of #39 to ensure that we didn't upgrade locks, but that was focused on preventing one kind of bug rather than explicitly allowing nested locks. AFAICT it was only incidentally allowed before that.
I can't remember if we need that support for e.g. partitions. But without good justification I'm feeling a bit squeamish about supporting it: it's very easy to get into a bad situation here for multiple reasons:
- Nested calls have the same problems vis-a-vis lock timeouts as allowing transactional DDL. You end up with nested timeouts, as well, which breaks our guarantees around timing.
- It's easy to order calls in different ways (even with respect to SQL/DML that's executing externally to the migrations!) that could cause deadlocks.
I'm wondering if we should focus instead on safer APIs (like the multiple table locking one here) that target specific use cases directly.
Note: I'm not sure that the concern about deadlocks (2) is actually prevented by this approach either, because I assume Postgres acquires locks on the relations in order of their presentation here...so maybe that's unavoidable.
This functionality will allow us to improve the safety of operations that take out locks on multiple tables (e.g. adding / removing foreign keys).
I also discovered a bug in the nested lock acquisition logic, so this PR fixes that and adds more tests.
I feel like it's tradition at this point for me to include too many changes in a PR and for @jcoleman to ask me to break it up into smaller commits 😆 (
I will try to do that tomorrowdone).