Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another solution to livelock #14

Open
drmingdrmer opened this issue Oct 21, 2019 · 48 comments
Open

Another solution to livelock #14

drmingdrmer opened this issue Oct 21, 2019 · 48 comments

Comments

@drmingdrmer
Copy link

drmingdrmer commented Oct 21, 2019

Update: 2020-03-07:

This approach works only with a set of instances in which every two of them interfere with each other. See @snaury 's comments below: #14 (comment)

The livelock problem

Epaxos specifies that in an SCC, the lowest seq commands should be executed
first.

And if there is a stream of interfering commands, execution thread needs to wait until walking through the entire SCC, before it could execute any command.
This is the livelock.

The solution provided by epaxos is to prioritize completing old commands over
proposing new commands
.

This solution might bring in some latency because we need to stop
proposing a new command to break the command chain.

I've been thinking maybe there is some other way to solve the livelock problem.

We assume that all commands mentioned below interfere with each other.

Ordering

Assume we determine the command order in an SCC by a tuple (seq, replicaID).
Then the first command to execute is the one with the least (seq, replicaID)
in an SCC.

Find the first command safe to execute

Provided with a committed, un-executed, interfering command list
CMDs = {Ra.A, Rb.B, Rc.C ..}(Ra.A is a command owned by replica Ra).

Assume that Ra.A has the least (seq, replicaID).

Define a function replicas(cmds): to extract a list of command owner replica,
e.g.: replicas({Ra.A, Rb.B}) = {Ra, Rb}.

Define allDeps, the union of all dependency of CMDs: allDeps = Ra.A.deps U Rb.B.deps U Rc.C.deps...

If CMDs contains only the least seq commands(with two un-executed interfering command Ra.X and Ra.Y:
if Ra.X.seq < Ra.Y.seq, and Ra.Y is in CMDs, Ra.X.seq must be in CMDs):

  1. For an un-executed command X not in CMDs and owned by one of replicas(CMDs), we have:
    X.seq > Ra.A.seq.

    Because X.seq is greater than the seq of its preceding command on a same
    replica.

    Thus X should never be executed before Ra.A.

  2. For an un-executed command Y not in CMDs and NOT owned by any of replicas(CMDs),
    if replicas(allDeps) ⊆ replicas(CMDs),
    then Y depends on CMDs.

    Because two committed interfering command U and V have at least one dependency
    relationship.

    Thus Y should never be executed before Ra.A.

With (1) and (2), we have the conclusion
that Ra.A is the command that should be executed first.

The abstract algorithm

Initialize an empty CMDs.
Add the next least (seq, replicaID) command and its deps into CMDs.
Repeat this step until we find a CMDs so that replicas(allDeps) ⊆ replicas(CMDs), then
execute the least command in CMDs.

Thus we could execute command even before the entire SCC committed.

Because the number of replicas is finite and small, this process would finish quickly,
even when there are a lot of interfering commands.

In this way searching for an SCC is no more required and there won't be a livelock problem anymore.

@imoraru
Copy link
Member

imoraru commented Oct 22, 2019

Yes, I think that works. Thanks for the writeup.

@snaury
Copy link

snaury commented Mar 4, 2020

Sorry for asking this, but how do you guarantee you have found the least (seq, replicaID)? If I understand correctly, commands don't have seq finalized until they are committed, which may happen in the order of their seq actually decreasing. So, if a replica observes part of a long (potentially infinite) SCC, with dependencies that are not yet committed, how can you be sure those dependencies won't have seq lower than whatever you have observed so far? For example:

A <------------- B <-------------- C <-- ...
|               ^ |               ^ |
\-> ab1 -> ab2 -/ \-> bc1 -> bc2 -/ \-> ...

I know epaxos technically uses optimized dependencies, but I'm trying to consider a more generic dependency graph, which seems possible to generate with 5 replicas (1 replica proposes A, B, C, ... in such a way that they always conflict pairwise, other replicas propose ab1, ab2, bc1, bc2, ... in such a way that it creates a loop for each pair).

In the above graph, after receiving commits for A, B, ab1 and ab2 (the first loop) is it really possible to know whether or not we have observed the least seq value? It seems like bc1 may end up with smaller or larger seq than these 4 during phase 1.

Could you please explain why (1) would hold in such a case? (e.g. bc1 was proposed by the same replica that proposed ab1 or ab2, such that replicas(allDeps) ⊆ replicas(CMDs) holds, but there is no direct conflict with either, except through B)

I'm probably missing something obvious here, or maybe graph like this is too generic/actually impossible?

@imoraru
Copy link
Member

imoraru commented Mar 4, 2020

As an aside, I would encourage you not to implement the version of the algorithm that requires "seq". Instead, rely on "deps" alone and ack a command to the client only after the command's dependency graph is finalized (or only after it has been executed). To order commands within an SCC use any order criterion as long as it is consistent across replicas. Without "seq", the algorithm has fewer corner cases and better consistency properties.

@snaury
Copy link

snaury commented Mar 4, 2020

I'm not sure it would help with solving livelock due to infinitely growing SCC (without stopping new proposes that is), as using seq as a tie breaker may be complicated, but doesn't seem to be the real issue. The problem I'm trying to understand is when we have an incomplete SCC and there are vertices that depend on that SCC, can we even be certain those vertices won't ever become part of it as we receive more vertices of the graph (assuming worst case scenario where commit messages between replicas are severely delayed and arbitrarily reordered), because execution order may reverse depending on it. I'm probably thinking about dependency graphs that are too general though.

Edit: I actually discovered epaxos after reading a recent paper on bpaxos (there was no seq, but no mention of efficient solutions to livelock problem either, which sparked my interest in finding one)

@imoraru
Copy link
Member

imoraru commented Mar 5, 2020

Well, that's why I called the observation about seq an aside, not a direct answer to your question.
About your question, though. When you talk about ensuring liveness in a real system, solutions usually revolve around better engineering. A simplistic blunt instrument in this case would be to say that replicas stop pre-accepting new commands as long as the dependency graph for their potential dependencies is larger than a certain threshold size. Replicas would thus wait until all of these dependencies finish committing. If implemented poorly, this could significantly reduce your throughput and increase your latency. If implemented well, this would hopefully kick in only so rarely that you wouldn't see its effect in the 99%ile latency.

That said, I do believe that there could very well be ways to tackle this from a theoretical point of view as well. I haven't thought about this too much, but you could try to reason about which dependencies are really necessary, and which dependencies are superfluous. The critical thing in EPaxos is to guarantee that if two commands interfere with each other, one will be in the other's deps graph. They don't need to be in the same SCC, and as you point out, it's better if they are not. Thus, are there some rules that replicas can use to reason "A must be in deps(B), so I won't add B to deps(A)"? I think so.

@drmingdrmer
Copy link
Author

Sorry for asking this, but how do you guarantee you have found the least (seq, replicaID)? If I understand correctly, commands don't have seq finalized until they are committed, which may happen in the order of their seq actually decreasing. So, if a replica observes part of a long (potentially infinite) SCC, with dependencies that are not yet committed, how can you be sure those dependencies won't have seq lower than whatever you have observed so far? For example:

Assumes our discussion is only about interfering commands:

I think what confused you is this step:

Add the next least (seq, replicaID) command and its deps into CMDs.

Let me elaborate on it:

Assumes the least (seq, replicaID) command the executing replica has been seen is X.

For an unseen command Y:

  • If X does not depends on Y, then Y must depend on X thus y must have greater seq.
  • If X depends on Y, thus Y must be finally added into CMDs.

Thus CMDs must contains the least seq command.

@snaury
Copy link

snaury commented Mar 5, 2020

If X does not depends on Y, then Y must depend on X thus y must have greater seq.

I'm not sure how to parse this. Are you talking about direct dependencies, or transitive dependencies as well? I assume direct, since in SCC everything depends on everything, in such case here's a counter example:

  • We have 5 replicas, initially replicas 1 and 5 are network partitioned
  • Replicas 2, 3, 4 had a series of commands making 100 updates to Z, such that the last command is Z.100 (which should be read as command writing to Z with seq==100)
  • Replica 1 gets out of being partitioned and receives proposals for 4 commands in rapid succession:
    • Command A is A := X + D
    • Command B is B := A
    • Command C is C := B
    • Command D is D := C
  • We assume some messages are lost, delayed and reordered

The sequence of events is as follows (number before the colon is the replica number):

1: comes out of a network partition
1: A B C D proposed in rapid succession
   initially A.0 <- B.1 <- C.2 <- D.3 (pairwise dependencies)
   and also A.0 <- D.3 because read/write to D
1: sends PreAccept messages to 2, 3, 4 but they are delayed and reordered

2: receives PreAccept(B.1, [A]), sends PreAcceptOK(B.1, [A]), i.e. there is no A yet, no updates
3: receives PreAccept(B.1, [A]), sends PreAcceptOK(B.1, [A]), i.e. there is no A yet, no updates
1: receives above replies, have a fast quorum for B.1, deps=[A], will commit eventually

3: receives PreAccept(C.2, [B]), sends PreAcceptOK(C.2, [B]), i.e. there is no B yet, no updates

3: receives PreAccept(D.3, [A, C]), sends PreAcceptOK(D.3, [A, C]), i.e. conflict with C.2, no updates
4: receives PreAccept(D.3, [A, C]), sends PreAcceptOK(D.3, [A, C]), i.e. there is no C yet, no updates
1: receives above replies, have a fast quorum for D.3, deps=[A, C], will commit eventually

5: comes out of a network partition
5: receives PreAccept(A.0, []), sends PreAcceptOK(A.0, []), no updates
4: receives PreAccept(A.0, []), sends PreAcceptOK(A.101, [Z, D]), updated on conflicts with Z.100 and D.3
1: receives above replies, have a quorum for A.101, deps=[Z, D]
1: runs accept phase for A.101 via 4 and 5, will commit eventually

4: receives PreAccept(B.1, [A]), sends PreAcceptOK(B.102, [A]), but it's too late (B.1 already selected)
4: receives PreAccept(C.2, [B]), sends PreAcceptOK(C.103, [B]), updated because it has B.102 internally
1: receives above replies, have a quorum for C.103, deps=[B]
1: runs accept phase for C.103 via 3 and 4, will commit eventually

Here's a short reference of events over time on each replica:

1: ....ABCD
2: ZZZ  B
3: ZZZ  BCD
4: ZZZ    DABC
5: ........A

Now that replica 5 is out of a network partition, it receives commits in some particular order:

5: receives and executes all Z commands
5: receives A.101, deps=[Z, D], new least command, but needs to wait for D
5: receives D.3, deps=[A, C], new least command, but needs to wait for C
5: receives C.103, deps=[B], not a new least command

Running your algorithm we may decide that D.3 is the least command that needs to be executed, however that is not the case. There's actually a B.1 command in the cycle with the lowest seq that we haven't seen yet. Am I missing something about how seq and your algorithm is supposed to work?

@drmingdrmer
Copy link
Author

In epaxos the execution consistency only guarantees order about interfering commands, not unrelated commands:

Execution consistency: If two interfering commands γ and δ are successfully committed (by any replicas) they will be executed in the same order by every replica.

That's why I said:

Assumes our discussion is only about interfering commands:

In your example D does not depend on B thus the order for B and D is not deterministic.


I'm not sure how to parse this. Are you talking about direct dependencies, or transitive dependencies as well? I assume direct, since in SCC everything depends on everything, in such case here's a counter example:

Here what I mean is transitively depending, so that a potential minimal seq is always reachable from an interfering command.

@snaury
Copy link

snaury commented Mar 6, 2020

In your example D does not depend on B thus the order for B and D is not deterministic.

But it does depend on B (indirectly), and replicas would diverge if this is not taken into account. For example, assume all keys have initial value of 0.

Case 1:

replica sees A, D, C first
executing Z.100: Z := 42
executing D.3:   D := C     // 0
executing A.101: A := Z + D // 42
new least command C would have to wait for C
executing B.1:   B := A     // 42
executing C.103: C := B     // 0
replica has Z = 42, A = 42, B = 42, C = 0, D = 0

Case 2:

replica sees B
executing Z.100: Z := 42
executing B.1:   B := A     // 0
executing D.3:   D := C     // 0
executing A.101: A := Z + D // 42
executing C.103: C := B     // 42
replica has Z = 42, A = 42, B = 0, C = 42, D = 0

Here what I mean is transitively depending, so that a potential minimal seq is always reachable from an interfering command.

Then I'm not sure how it helps with livelocks. Waiting for all transitive dependencies would mean replica needs to see a complete SCC, which may grow indefinitely, hence a livelock.

Anyway, thank you for your answers. I think using some kind of monotonically increasing per-replica propose time as a tie breaker (used for breaking ties only, thus clock drift would only affect latency of long chains of conflicting commands and their dependencies, so not much of an issue) is more promising for solving this problem.

@drmingdrmer
Copy link
Author

Update of my previous comment:

As you had the setup with D->A but without D->B, I assumed we were talking about the unoptimized deps, epaxos 4.5 describes the optimization of deps:

Instead of including all interfering instances, we include only N dependencies in each list: the instance number R.i with the highest i for which the current replica has seen an interfering command (not necessarily committed).

  • For unoptimized deps, my algorithm should work, because any two interfering commands have a direct depends-on relation.

  • For optimized deps, D->A is unnecessary in the setup.

In order to let this algorithm to work with optimized deps, a minor change needs to add to it:

Add the next least (seq, replicaID) command and its deps into CMDs.

Here we need to retrieve all of the unoptimized dependent commands. Back to your steps:

5: receives D.3, deps=[A, C], new least command, but needs to wait for C
5: receives C.103, deps=[B], not a new least command

The optimized deps of D.3 is A,C and needs to expand to A,B,C.

@drmingdrmer
Copy link
Author

hi @snaury what about my latest comment?

@snaury
Copy link

snaury commented Mar 6, 2020

hi @snaury what about my latest comment?

I don't understand what you mean by any two interfering commands having a direct depends-on relationship. Commands B and D don't interfere, so don't have a relationship in my example. Do you propose expanding relationships during phase 1 or during execution? The former makes it harder to think of a counter example, but I'm not sure there isn't one (having more replicas allows for longer and more out-of-order chains, but it's harder to analyse), besides you would need to prove certain properties for the final committed seq, but I'm not sure what those are. The latter I think would mean you always expand your least seq relationships (each new discovered node brings more transitive dependencies) until you have the complete SCC.

@drmingdrmer
Copy link
Author

I don't understand what you mean by any two interfering commands having a direct depends-on relationship.

For two committed interfering command X and Y, there is at least one replica that has both of them.
Thus X depends on Y or Y depends on X or both.

Commands B and D don't interfere, so don't have a relationship in my example.

Yes.

Do you propose expanding relationships during phase 1 or during execution?

During execution.

The former makes it harder to think of a counter example, but I'm not sure there isn't one (having more replicas allows for longer and more out-of-order chains, but it's harder to analyse), besides you would need to prove certain properties for the final committed seq, but I'm not sure what those are.

I think epaxos gives complete proof about the consistency of the committed graph.

The latter I think would mean you always expand your least seq relationships (each new discovered node brings more transitive dependencies) until you have the complete SCC.

It does expand the least seq command, but this loop stops when the set of replicas of CMDs stops growing:

Repeat this step until we find a CMDs so that replicas(allDeps) ⊆ replicas(CMDs), then execute the least command in CMDs.

@snaury
Copy link

snaury commented Mar 6, 2020

For two committed interfering command X and Y, there is at least one replica that has both of them.
Thus X depends on Y or Y depends on X or both.

I'm not sure how that helps. Here's another example, we are constructing a graph that looks like this:

A <-> B <-> C <-> D <-> E <-> F
↑                             ⇵
U <-- V <-- W <-- X <-- Y <-- Z

There is a long chain of commands A, B, C, D, E, F conflicting pairwise. There is also a long chain of commands U, V, W, X, Y, Z where U conflicts with A and eventually Z conflicts with F. Here's an incomplete diagram of how proposes/pre-accepts and accepts are spread over time:

1: A u U   C       b B   A  ... E ...
2: B                 B a A  ... D ... F ...
3:       b     v V   B a A
4:   u U       v V      
5: U   U     V   V          ... W ... X ... Y ... Z

Here's a textual description of how these commands are processed and what happens to their seq and deps:

propose    A.0, B.0, U.0
pre-accept u.1 -> A
accept     U.1 -> A
pre-accept b.0
propose    C.0
propose    V.2 -> U
pre-accept v.2 -> U
accept     V.2 -> U
pre-accept b.1 -> A, C
accept     B.1 -> A, C
pre-accept a.2 -> B
accept     A.2 -> B
...

Here we assume that PreAccept messages are sent immediately after command is proposed, but are somehow delayed in transit and may arrive later in time.

Notice we have already accepted these commands:

  • A.2 -> B
  • B.1 -> A, C
  • U.1 -> A
  • V.2 -> U

Notice also that replica 2 has never seen (and will never see) any of U, V. Eventually C will be accepted with e.g. seq=3 and replica 2 will see this:

  • A.2 -> B
  • B.1 -> A, C
  • C.3 -> B, D

Replica 2 is not aware of U, and its owner 5 is not in its dependencies (the command that would eventually link A back to U is not even on anyones radar yet, but will be in some distant future). Its least seq, replicaID command order would thus be B.1, A.2, C.3, ..., and maybe eventually U.1, V.2, ...

However on a replica that is aware of U (or waits for a complete SCC, which is prone to livelock) the order would be B.1, U.1, A.2, V.2, etc. Notice how directly conflicting commands A and U are executed in different order on different replicas. This is inconsistent and unsafe.

Yes, there is a replica that sees both A and U, but it doesn't help in my example. There may also be a replica that only sees one branch of the loop.

@snaury
Copy link

snaury commented Mar 6, 2020

Here's an even better diagram where both replicas 1 and 2 are never aware that U exists:

1: A       C         b B   A  ......... E .....
2: B                   B a A  ... D ......... F
3:   a u U   b   v V   B   A  .................
4:     u U       v V          .................
5: U     U     V   V       W  ... X ... Y ... Z

      propose A.0, B.0, U.0
pre-accept-ok a.0
pre-accept-ok u.1 -> A
     accepted U.1 -> A
      propose C.0
pre-accept-ok b.1 -> A
      propose V.2 -> U
pre-accept-ok v.2 -> U
     accepted V.2 -> U
pre-accept-ok b.1 -> A, C
     accepted B.1 -> A, C
pre-accept-ok a.2 -> B
     accepted A.2 -> B

Only a single replica 3 observes both branches of the eventual loop.

@drmingdrmer
Copy link
Author

You are right. My algorithm works only with a set of commands in which every two of them interfering with each other.

Thanks for the detailed diagram to explain 😆😆

Although it is possible to generalize this algorithm to work, looking for the subgraph containing the least command can not be done within finite steps.

By the way, I did not use seq in my implementation either(as imoraru suggested:)).
I just use deps as a vector clock to sort all commands, with some additional restriction. It seems easier to proof and to programming. When I finished, it would be nice to get your options.:))

@snaury
Copy link

snaury commented Mar 12, 2020

I think I have a better solution to livelock. Here's a sketch of an idea:

  • If there are conflicting instances A and B, then A.deps and B.deps are treated as bidirectional at commit time, i.e. they show conflict edges, not dependencies
  • If there's an edge between instances A and B in the final graph, then A is executed before B if (A.seq, A.owner) < (B.seq, B.owner), and edge becomes directional
  • The final graph has no cycles, so no livelock problem exists
  • The invariant is that if A.seq <= B.seq then A will be in B.deps
  • The trick is to extend deps during the accept phase with (possibly redundant) conflicting instances

Proof sketch for conflicting nodes A and B:

  • Suppose A is committed with a fast commit, and B is not in A.deps. That means initial (A.seq, A.deps) was PreAccepted on a quorum of replicas. That means conflicting instance B will have B.seq >= A.seq + 1 and A will be in B.deps, as there will be a quorum intersection during phase 1 for B.
  • Suppose A is committed using paxos accept, and B is not in A.deps. That means during accept phase of some final (A.seq, A.deps) on a quorum of replicas B was not PreAccepted yet and was not yet known, and A will eventually commit without modifications. That means conflicting instance B will have B.seq >= A.seq + 1 and A will be in B.deps, as there will be a quorum intersection during phase 1 for B.
  • The same is true with A and B reversed. The key idea is that if B is not in A.deps then A.seq < B.seq, which means A will always be executed before B, and missing edge is irrelevant.
  • In all other cases A and B will have each other in deps if replicas add all currently conflicting instances to deps (without changing seq) before sending AcceptOK with those resulting instances. The leader will then use a union of all deps and commit the final result. We don't know what the final A.seq and B.seq will be, but we know both instances will have each other in their deps, since when phases don't intersect on replicas will result in either of the first two cases.
  • The tricky part is on recovery we may commit final deps that are slightly different from those the original leader (would have) chosen, since we may use a different quorum. Any of those deps are equally valid, in that there will be some redundant dependencies: when A.seq < B.seq we are guaranteed to have A in B.deps, so it's redundant and equally valid to have and not have B in A.deps, depending on a quorum we use for the decision.

The execution algorithm is then simple: for every instance A wait for commits of all A.deps. Since we see all important edges between A and its dependencies start executing A after executing all dependencies that have a lower (seq, replicaID) tuple.

The graph has no loops, so there is no livelock.

What do you think?

@drmingdrmer
Copy link
Author

drmingdrmer commented Mar 13, 2020

😲😲😲
Is the core idea about the final complete_deps like the following?

Compare the complete commit-time deps set(complete_deps) of A and B.
E.g. walk through deps graph of A and add all instances into complete_deps.

Finally A.complete_deps and B.complete_deps have 3 relations:

  • A.complete_depsB.complete_deps: B after A.

  • A.complete_depsB.complete_deps: A after B.

  • A.complete_depsB.complete_deps and A.complete_depsB.complete_deps
    in this case, determine the order by (seq, owner).

If my understanding of your idea above is correct, I believe we have almost the same solution.
If it is, in this way the seq is not necessary.

Comparing complete_deps to determine an order is quite enough. E.g. find out the
max instance-id of n owners in a complete_deps: [max(R0), max(R1)...].
It is a vector clock about when an instance becomes stable, as I mentioned
in my previous comment.

@drmingdrmer drmingdrmer reopened this Mar 13, 2020
@snaury
Copy link

snaury commented Mar 13, 2020

I'm not sure what you mean about comparing complete_deps like that. In my idea seq is necessary because it is stable and determines the final order. Determining order using set intersections doesn't seem safe, because then it could be sensitive about complete_deps being in flux (it's unstable and may change during recovery, different replicas may see different complete_deps for the same instance, is it possible to show execution order doesn't change in that case?).

What I show is that if A executes before B and B is not in A.complete_deps, then it's ok for replica to not be aware of B and it's still safe for this replica to execute A before it even becomes aware of B. Other replica may recover a slightly different A that has B in A.complete_deps, but it doesn't change the execution order, so it's safe.

@imoraru
Copy link
Member

imoraru commented Mar 13, 2020

The fact that you don't have a cycle in the complete dependency graph doesn't seem sufficient to avoid livelock. You also need the ability to start executing before the entire graph is fully committed.

Consider the following sequence of PreAccepts, with non-transitive interference relations C1 ~ C2 ~ C3 ~ ... ~ Cn ~ C0

Time Replica1 Replica2 Replica3
1 C0 seq = 0 C0 seq = 0
2 C1 seq = 0 C2 seq = 0
3 C3 seq = 1 C1 seq = 1
4 C2 seq = 2 C4 seq = 2
5 C5 seq = 3 C3 seq = 3
6 C4 seq = 4 C6 seq = 4
7 C7 seq = 5 C5 seq = 5
8 C6 seq = 6 C8 seq = 6
... ... ... ...
n Cn seq = n-2 Cn-2 seq = n-2
... ... ... ...

Assume that C0 commits only after this entire sequence of events. It's clear that we can construct examples like this with arbitrarily long chains of dependencies. After we commit C1 and try to execute it, we must follow the deps graph edges until we eventually get to C0, which has no deps. This is livelock, because n can be arbitrarily high. At no point can we execute C1 without having finalized the entire deps graph, because we never know that C1 has the smallest seq of any command we might discover by following the deps edges to the end.

@snaury
Copy link

snaury commented Mar 13, 2020

This is livelock, because n can be arbitrarily high.

I don't think it can be called a livelock, because each instance reaches commit state independently and takes a finite number of steps. In my scheme as soon as C0 is committed it may or may not have a set of interfering nodes that would have to be observed, but otherwise its execution cannot be delayed indefinitely. The same is with C1, when it eventually commits (it should take a finite time, as it only depends on communication with a subset of other nodes and available cpu time for processing replies) it would have accumulated some other instances that would have to be observed before it could be executed, but that set of potential dependencies is again finite. For each committed instance you only need to observe a set of its directly interfering instances (that may or may not end up as dependencies) frozen at commit time. Sequence number of interfering command chain increases, except between some commands that overlap in time, so you would eventually reach some already executed stable state.

This is different from original epaxos, where SCC can grow indefinitely and can never be complete, so it would never be executed. I may be mistaken, but I assumed that was the livelock problem mentioned in the paper.

When committed instances have other instances in their interference set and those instances are not committed for a long time the recovery should kick in and either commit the original command or replace it with no-op. It's probably possible for several replicas to fight over who finishes recovery, but that's a separate issue for paxos in general.

@snaury
Copy link

snaury commented Mar 13, 2020

You also need the ability to start executing before the entire graph is fully committed.

Also, that's exactly what I'm doing. Let's assume C1 is committed with seq = 1 and deps = C0, C2, C3, C4 (e.g. at time 5). It would only have to wait for its deps to commit, at which point we would drop C2, C3 and C4 from its deps (they would have a higher seq). The final result is that C1 only has to wait for C0, which would not be dropped (lower seq) and will be executed first, then C1. Even though n can be very large (all those instances would form an SCC in original epaxos) in my proposal C1 is executed as soon as 4 other instances are committed and C0 is executed (which would have no deps, assuming it uses fast commit).

@snaury
Copy link

snaury commented Mar 13, 2020

Moreover, if C0 and C1 in your example don't interfere (at first I assumed it does), then C1 would not have C0 in its deps and would not have to wait for it before executing, and C1 will only end up with C2 in its deps. As soon as C2 is committed it would be obvious that C1 should be executed first (both have each other in deps, but only one direction is a dependency).

Yes, C0 has lowest seq, but if C0 and C1 don't interfere they can (and would) be executed in arbitrary order. In my proposal seq only determines order between a pair of instances, not an entire graph. Only when we reach Cn would ordering between C0 and Cn start to matter.

Note that in original epaxos dependencies in the final graph may end up looking like this:

C1 <-> C2 <-> C3 <-> C4 <-> ... <-> Cn <-> C0

This is an arbitrarily large SCC and we have to wait until it is complete. However in my proposal graph would end up looking like this:

C1 <- C2 <- C3 <- C4 <- ... <- Cn -> C0

There are no ambiguities in the graph, everything is linear (since your example is very tidy), it's obvious that C1 and C0 must be executed first, but it may happen in any order. If seq are more randomized, then some arrows would be flipped accordingly, but you won't have to wait for the entire graph before deciding it can be executed, you only wait for a small subset of overlapping commands that you are not yet sure about. It doesn't matter what exact order is chosen between each pair of overlapping commands, it just needs to be consistent on each replica.

@imoraru
Copy link
Member

imoraru commented Mar 13, 2020

C1 and C0 must be executed first, but it may happen in any order

Ah, yes, that's true.

Why do you need to update deps in the Accept phase of a command? Can you give an example where that's necessary?

@snaury
Copy link

snaury commented Mar 13, 2020

Why do you need to update deps in the Accept phase of a command? Can you give an example where that's necessary?

Here's a diagram:

 t | 1 2 3 4 5 |
---+-----------+----------------------------------------
 1 |     C C C | C is committed with C.seq=100 (via some out of scope dependencies)
 2 | A       B | A and B are proposed, there are no initial conflicts, so A.seq = 0, B.seq = 0
 3 |   a a     | A is pre-accepted on replicas 2 and 3, replica 3 updates A.seq=101, A.deps=[C]
 4 |   b   b   | B is pre-accepted on replicas 2 and 4, replica 2 updates B.seq=1, B.deps=[A]
 5 | A A A     | A is accepted via replicas 1, 2, 3, it sees pre-accepted B, replica 1 commits A.seq=101, A.deps=[C, B]
 6 |   B   B B | B is accepted via replicas 2, 4, 5, no additional updates, replica 5 commits B.seq=1, B.deps=[A]

The final graph is C <- A -> B, i.e. A must not be executed before both B and C (which don't interfere). Note that if we didn't update A.deps in the accept phase, then replica 1 would not be aware it needs to wait for B (which ends up with a lower seq due to a particular choice of participating replicas) before executing A (even though B must be executed before A). Because we update A.deps with (possibly redundant) interfering commands, all executing replicas would be forced to wait for B to commit, at which point they would see that B must be executed before A.

Also note that if accept phase of A did not intersect with B (e.g. because pre-accept of B on replica 2 happens after accept of A), then that would mean B did not finish pre-accept yet, and would necessarily pick B.seq = 102, with final graph changed to C <- A <- B. In that case it's ok for A to be committed without B in A.deps.

@imoraru
Copy link
Member

imoraru commented Mar 13, 2020

Oh, I suppose it's because you want to ensure that you see all the commands that interfere with your command, and which might still get a lower seq than your final seq. The fact that the Accept has the final seq for your command guarantees that all other commands (that you don't see) will get higher seqs.

Yeah, I think it works. If I were to describe this I wouldn't make the Accept phase add anything to the final committed deps. Instead I would say the extra deps are just an execution hint for the command leader. This way you don't have to complicate recovery.

Finally, I want to point out, as I did in the paper, that this seq-based algorithm doesn't ensure strict serializability when command interference is non-transitive. For example, say that a user writes two objects, A and B (initially A = 0, B = 0). First the user commits A = 1, then a time later, the user commits B = 1. While all this is happening, someone else is trying to read both A and B. Consider the following PreAccept order:

Time Replica 1 Replica 2 Replica 3
1   C0, seq = 0  
2   Read A&B, seq = 1  
3   Write A, seq = 2  
4     Write A, seq = 2
5      
6 Write B, seq = 0    
7     Write B, seq = 0
8 Read A&B, seq = 1    

Read A&B is ordered before Write A, but after Write B. As a consequence it returns A = 0 & B = 1, which is not consistent with the time ordering of the operations.

@snaury
Copy link

snaury commented Mar 13, 2020

Finally, I want to point out, as I did in the paper, that this seq-based algorithm doesn't ensure strict serializability when command interference is non-transitive.

Yes, thank you, I'm aware. :) Our database is currently serializable, but not strict serializable (by default), due to some optimizations that cause similar "time travel" issues. But then just like the paper states it's possible to wait for dependency chain to be committed before acknowledging commit to client, that would make sure any newly interfering command acquires a higher seq than what we depended on.

@drmingdrmer
Copy link
Author

I'm not sure what you mean about comparing complete_deps like that. In my idea seq is necessary because it is stable and determines the final order. Determining order using set intersections doesn't seem safe, because then it could be sensitive about complete_deps being in flux (it's unstable and may change during recovery, different replicas may see different complete_deps for the same instance, is it possible to show execution order doesn't change in that case?).

A.complete_deps is the set of instances A can walk to, before A is committed.
It only includes instances A has seen during pre-accept phase. E.g. when handling pre-accept,

  • on R1 there is a finite depends-on graph D1 rooted from A,
  • on R2 there is a finite depends-on graph D2 rooted from A,
    ...

A.complete_deps is the union of all nodes in D1, D2....

deps is a finite set of all interfering instances with A.
complete_deps is a sub-graph of the final SCC that contains A.
deps ⊆ complete_deps ⊆ SCC.
complete_deps does not change after committed. Sorry for the bad name. It is not complete. ˙––˙

Determine the order:

If A did not see B during pre-accept(A in B.complete_deps but B not in A.complete_deps, execute A before B.

Let max(A, R[i]) to be the max id of instance that is in A.complete_deps and belongs to owner R[i].

sort A and B by sum(max(A, R[0]), max(A, R[1])...) and sum(max(B, R[0]), max(B, R[1])...).

It guarantees the execution consistency and does not need a seq.

@snaury
Copy link

snaury commented Mar 14, 2020

If I were to describe this I wouldn't make the Accept phase add anything to the final committed deps. Instead I would say the extra deps are just an execution hint for the command leader. This way you don't have to complicate recovery.

Initially I thought replicas would persist new deps to later reply with the earliest possible result. But now I think you're right, it could just be a hint if when replica receives Accept(A) it would only include conflicting instance B in the hint if B.seq <= A.seq:

  • If replica had accepted B in the past and B.seq > A.seq, then B.seq will not change and B will be executed after A, so no reason to include B in the hint.
  • If replica had pre-accepted B in the past and B.seq > A.seq, then if it's part of pre-accept quorum the final B.seq will not be smaller, but leader may choose a different pre-accept quorum with possibly lower B.seq. However, since Accept(A) must also intersect with that different quorum, B will be included in the hint by a different replica if it happens to have B.seq <= A.seq, thus no reason to include B in the hint from this replica.
  • If we choose initial seq not just based on direct conflicts, but on accumulated max_seq that we increase during communication of other replicas, then likelyhood of newly proposed instances to spuriously be included in the hint during recovery diminish other time, so there's no need to persist it until we commit.

Btw, I realized that ordering by (seq, replicaID) may not be the best idea, as multiple conflicting instances from the same replica may end up with the same final seq. I think if each replica chooses initial-seq based on communication with other nodes (like hybrid logical clock), and each instance proposed by the same replica has distinct initial-seq values, then it may be better to use (seq, initial-seq, replicaID) for ordering conflicting nodes, giving earlier proposals priority, not replicas that happened to have a lower replicaID. It may behave nicer when conflict rate is closer to 100%.

Anyway, I'm not sure why people hate seq so much. I love seq, it gives very cool properties to the dependency graph. :)

@drmingdrmer
Copy link
Author

drmingdrmer commented Mar 14, 2020

Anyway, I'm not sure why people hate seq so much. I love seq, it gives very cool properties to the dependency graph. :)

To me, seq is an extension to determine the order. The dependent graph could have contained enough information for ordering. seq adds more rules to this system in order to work correctly. The fewer rules there are, the easier people understand it and write the right codes.

@snaury
Copy link

snaury commented Mar 14, 2020

If A did not see B during pre-accept(A in B.complete_deps but B not in A.complete_deps, execute A before B.

You need to be a little more formal about your statements. Here's an example:

| 1 2 3 |
+-------+
| A B C |
| c a b |

At time 1 three instances are proposed on each replica. All of them send pre-accept to the next replica and eventually commit the result. What you get is a perfect loop, where A depends on B depends on C depends on A... and you need to somehow disambiguate the order of such SCCs, getting you into a livelock bind. Using 5 replicas it's possible to engineer chain-linked perfect loops with 4 instances (what you need to do is engineer a sequence of propose and pre-accept events over time for 4 instances, where 1 instance does not exist at the first time slot, and 1 instance and is already complete at the last time slot, then you overlap several loops like that in time, sharing one instance), it's even harder to deal with, especially if you engineer a parallel backwards loop like I showed earlier with 2-instance loops.

If you start accumulating additional dependencies during the accept phase like I propose, you must be aware that different replicas (leader and recovery) may use different quorums (we must consider the case where each replica may only be able to communicate with a subset of other replicas, you may not always see the full picture), accumulate different dependencies, and you may have different dependency graphs on different replicas. I my case that's perfectly fine, since I show how using seq for determining direction causes it to never change. Using other arbitrary criteria may cause the order to change.

@drmingdrmer
Copy link
Author

If A did not see B during pre-accept(A in B.complete_deps but B not in A.complete_deps, execute A before B.

You need to be a little more formal about your statements. Here's an example:

| 1 2 3 |
+-------+
| A B C |
| c a b |

Sorry, it's my bad.

  • If A is initiated after B becomes safe: A should execute after B.
  • If A is initiated before B becomes safe, any consistent order for A and B is reasonable.

Safe means:

  • B has collected enough pre-accept-reply and can commit on fast-path,
  • or B has collected enough accept-reply and can commit on slow-path.

B being safe implies recovery will never choose a different value of B.

If A after B being safe is satisfied, then there is a quorum contain identical safe values of B.
A will see the safe value of B at least once.
B.complete_deps will be added into A.complete_deps.
B.complete_deps ⊆ A.complete_deps.
sum(max(A, R[i])...) >= sum(max(B, R[i])...).
∴ It satisfies the Execution linearizability guarantee: If two interfering com- mands γ and δ are serialized by clients (i.e., δ is pro- posed only after γ is committed by any replica), then every replica will execute γ before δ.

If you start accumulating additional dependencies during the accept phase like I propose, you must be aware that different replicas (leader and recovery) may use different quorums (we must consider the case where each replica may only be able to communicate with a subset of other replicas, you may not always see the full picture), accumulate different dependencies, and you may have different dependency graphs on different replicas. I my case that's perfectly fine, since I show how using seq for determining direction causes it to never change. Using other arbitrary criteria may cause the order to change.

I do not know why this is a problem.
The leader and recovery process should always commit the same value if both of them committed.
And only one of them could have committed if they choose different values.
And finally, every replica has the same graph. Thus the execution algorithm has the same view on every replica. Thus it should not be a problem. What I missed?

@snaury
Copy link

snaury commented Mar 14, 2020

The leader and recovery process should always commit the same value if both of them committed.

That does not magically happen, the reason the committed value will be the same on recovery is because when there is potential for disagreement (there is no fast quorum of pre-accept-eq replies) epaxos goes thru a separate accept phase. Only when accept phase acquires a quorum can you be sure that the exact value you accepted will be discovered during recovery. Conversely if recovery is in progress and there's a risk of choosing a different value the leader would not be able to finish accept for its value, since quorum will be blocked by a prepare phase of a replica which started recovery.

What I'm doing is I acquire additional hints during the accept phase, but I don't have a separate accept phase for those hits (only the command itself, seq and original deps are fixed during the accept phase). That means all those additional hint dependencies are unstable, they depend on the exact quorum which is used for recovery, they are in flux and they may change on recovery. What's more there may be a race between a commit broadcast from the original leader, and a slightly different commit broadcast from another replica which started recovery (they would have the same seq and command, but not accumulated dependencies). In my case it doesn't compromise safety, because some of those hint dependencies are redundant. Only those hints that are present in all possible recovery quorums are actually necessary, which means if replica receives multiple commit messages for the same instance with different hint dependencies it can update its graph by using an intersection.

I'm not sure how your solution can deal with different commit messages that have different complete_deps in them. I suppose you could try layering more and more accept phases on top of each other, until you reach some union that agrees completely, but that would hurt latency and make recovery very complicated.

@drmingdrmer
Copy link
Author

The leader and recovery process should always commit the same value if both of them committed.

That does not magically happen, the reason the committed value will be the same on recovery is because when there is potential for disagreement (there is no fast quorum of pre-accept-eq replies) epaxos goes thru a separate accept phase. Only when accept phase acquires a quorum can you be sure that the exact value you accepted will be discovered during recovery. Conversely if recovery is in progress and there's a risk of choosing a different value the leader would not be able to finish accept for its value, since quorum will be blocked by a prepare phase of a replica which started recovery.

I do not understand. In my opinion epaxos recovery guarantees that:

  • If the leader believes to commit on fast-path is safe, recovery should always commit the same value the leader had chosen(e.g. there are several restrictions for commit-on-fast-path that must be met). Do you mean if the leader did not start accept phase, but committed directly on fast-path, the recovery would not choose the same value as the leader had chosen?
  • Otherwise the leader go on slow-path and when the leader receives enough positive replies, the recovery process should also commit the same value the leader chosen in accept-phase.

And yes, original epaxos does not guarantee the hint would be committed with the same value by leader and recovery.

Thus you have the intersection solution.
My thought is that to get rid of deps, instead, to commit complete_deps(you call it accumulated_deps I think it is more precise:))).

In every phase, use complete_deps as a replacement of deps thus the epaxos guarantees the value of complete_deps would be consistently chosen by leader or recovery.

I'm not sure how your solution can deal with different commit messages that have different complete_deps in them. I suppose you could try layering more and more accept phases on top of each other, until you reach some union that agrees completely, but that would hurt latency and make recovery very complicated.

No more phase, just pre-accept and accept, then commit.

@snaury
Copy link

snaury commented Mar 14, 2020

In every phase, use complete_deps as a replacement of deps thus the epaxos guarantees the value of complete_deps would be consistently chosen by leader or recovery.

Can you show some examples how you would decide on complete_deps? The way I understand it is you are using my idea of attaching extra dependencies in Accept replies, but I may just be assuming because you proposed it after my comments on it. Here's an example without fast commit (adding extra commands that force accept phase would just clatter the picture) assuming I understood correctly:

t | 1 2 3 4 5 |
--+-----------+----
1 | A a a     | replica 1 proposes A, sends PreAccept to 2, 3
  |         B | replica 2 proposes B, PreAccept messages are lost/stuck in queues
  | A A     A | replica 1 receives PreAcceptOK replies, sends Accept(A, deps=[]) to 2, 5
  | A         | replica 1 receives AcceptOK replies from 2 and 5, saves Commit(A, complete_deps=[B]) locally and dies
  |   A A A   | replica 2 decides to recover A, successfully prepares on replicas 2, 3, 4
  |   A A A   | replica 2 does recovery using Accept(A, deps=[]) on replicas 2, 3, 4
  |   A       | replica 2 saves Commit(A, complete_deps=[]) and broadcasts

Now what is the right commit value? The original leader thinks A depends on B (because it received a hint about B during the accept phase). Replica 2 that did recovery thinks A does not depend on B and may have already executed yet. As you can see different replicas have different graphs, so I must not understand what you propose.

How do you accumulate dependencies so they are committed consistently? How do you solve chain linked loops which grow indefinitely? How do you decide you are not missing some instances that will loop back on you later?

@drmingdrmer
Copy link
Author

In every phase, use complete_deps as a replacement of deps thus the epaxos
guarantees the value of complete_deps would be consistently chosen by leader
or recovery.

Can you show some examples how you would decide on complete_deps?

complete_deps is decided the same as deps, in pre-accept phase only.
The content of complete_deps is almost the same as your accumulated deps,
But no need to update it in accept phase.

To run with complete_deps is the same as with deps:

  • Leader sends pre-accept with initial complete_deps.

  • Replicas receives pre-accept-request, update complete_deps by taking union
    of it and interfering instances in this replica, and also, other indirect
    dependent instances. Finally it responds with the updated complete_deps:

  • Leader receives pre-accept-reply-s,
    If the condition to commit on-fast-path is satisfied, commit.
    The conditions are also the same as epaxos specified.
    The condition could be one of:

    • all responded complete_deps is the same as inital
      complete_deps,
    • or all updated complete_deps are the same and all new dep
      are committed;
    • ...
  • Otherwise, leader sends accept-request with union of all responded
    complete_deps.

    Because enough hint info has been collected in pre-accept-phase, we do not
    need to do anything else in accept phase.

  • If ⌊n/2⌋ + 1 positive replies received, commit.

The triangle circle example:

t | 1 2 3 |
----------+
1 | A B C | initiate A B C
2 | c a b | pre-accept.
3 |       | Receive pre-accept-reply, A.complete_deps=[B], B.complete_deps=[C], C.complete_deps=[A]
4 | C A B | Can not commit on fast-path. send Accept requests 
5 |       | A, B, C commit with A.complete_deps=[B], B.complete_deps=[C], C.complete_deps=[A]

Finally no complete_deps is super set of any other. The order of ABC is
decided by their rank: rank(X) = sum(max(X, R[0]), max(X, R[1]), ...),
where max(X, R[i]) is the max instance id of owner R[i] in the set {X} ∪ X.complete_deps.
Here it is modified: add X itself into the set thus there wont be rank(A) == rank(B).

rank(A) = sum(max(A, R0), max(A, R1), max(A, R2)) = sum(A, B, 0) = A+B
rank(B) = sum(max(B, R0), max(B, R1), max(B, R2)) = sum(0, B, C) = B+C
rank(C) = sum(max(C, R0), max(C, R1), max(C, R2)) = sum(A, 0, C) = C+A

The way I understand it is you are using my idea of attaching extra
dependencies in Accept replies, but I may just be assuming because you
proposed it after my comments on it.

I have been using the vector-clock way to solve ordering problem in my project.
But I found our ideas are quite similar: to get a just big enough sub graph of
SCC before committing, thus I was trying to describe it with your terminology.

Here's an example without fast commit (adding extra commands
that force accept phase would just clatter the picture) assuming I understood
correctly:

t | 1 2 3 4 5 |
--+-----------+----
1 | A a a     | replica 1 proposes A, sends PreAccept to 2, 3
2 |         B | replica 2 proposes B, PreAccept messages are lost/stuck in queues
3 | A A     A | replica 1 receives PreAcceptOK replies, sends Accept(A, deps=[]) to 2, 5
4 | A         | replica 1 receives AcceptOK replies from 2 and 5, saves Commit(A, complete_deps=[B]) locally and dies
5 |   A A A   | replica 2 decides to recover A, successfully prepares on replicas 2, 3, 4
6 |   A A A   | replica 2 does recovery using Accept(A, deps=[]) on replicas 2, 3, 4
7 |   A       | replica 2 saves Commit(A, complete_deps=[]) and broadcasts

First, at t=1, A received 2 pre-accept replies, R1 could commit on fast-path.
I do not know why there will be an accept-phase for A.
at t=1 R1 should commit with A.complete_deps=[].

If R1 insists to run into an accept phase, it should be correct too.

According to epaxos, there is not any update to A when R1 received
accept-reply.
At t=4 R1 believes A.complete_deps=[] is safe to commit.
B should not be added to A.complete_deps.

At t=5, if recovery process sees any accept-phase A, it should respect the one with
highest ballot, thus recovery process still choose A.complete_deps=[].
This is quite the same as classic paxos specifies.

Now what is the right commit value?

A.complete_deps=[]

The original leader thinks A depends on B (because it received a hint
about B during the accept phase).

The leader does not need to update A after receiving accept-reply.

Replica 2 that did recovery thinks A does not depend on B and may have
already executed yet.

Yes, leader and recovery should have chosen the same value.

As you can see different replicas have different graphs, so I must not
understand what you propose.

How do you accumulate dependencies so they are committed consistently?

Accumulated dependencies(or complete_deps) is collected the same way as deps
is. No update after accept-phase, thus there is no inconsistency.

How do you solve chain linked loops which grow indefinitely?

complete_deps is finite and is determined when commit, it does not grow as
other linked instances committed.

How do you decide you are not missing some instances that will loop back on you later?

As previously I mentioned(you did not reject it thus I assume it is true:D):

If the entire SCC is committed, executing instances by the rank: rank(X)
respects the execution linearizability guarantee, because we have:
A after B being safe implies rank(A) > rank(B),

Thus to execute A, we do not need to care about instances with higher rank:

  1. Wait for every instance in A.complete_deps to be committed.

  2. Sort the set {A} ∪ A.complete_deps by rank(X).
    If A has the least rank, execute A.
    Otherwise, exeucte the least rank instance from step 1.

    There wont be a loop because rank(X) keeps going down.

@snaury
Copy link

snaury commented Mar 15, 2020

complete_deps is decided the same as deps, in pre-accept phase only.

Wouldn't it be just normal deps from original epaxos then, what's the difference?

The content of complete_deps is almost the same as your accumulated deps,
But no need to update it in accept phase.

But if it's decided in pre-accept phase, then it's not even close to being the same, my accumulated deps are accumulated during accept phase, how can it be the same if yours happens in pre-accept phase?

Here it is modified: add X itself into the set thus there wont be rank(A) == rank(B).

I'll take your word for it, but I don't understand how you could ever guarantee that, especially in very complex loops. You even show a table with values for rank(A), rank(B) and rank(C), but A, B and C could have the same instance id (it's assigned at propose time and local to each replica), so wouldn't they be all equal to each other?

First, at t=1, A received 2 pre-accept replies, R1 could commit on fast-path.

I omitted fast path, because it's irrelevant to the issue (it's basically a cheat where you got lucky and something happened the same way on all replicas, being lucky is good for performance, but algorithm must be safe when fast path doesn't exist). Working around fast path is always possible, but makes graph messy, it's easier to consider two instances instead of 3, 4, 5, etc.

If the entire SCC is committed

The problem I'm trying to solve is how to start executing the graph when entire SCC is not committed, and may never be committed, because it never stops growing. Think what happens when conflict rate is close to 100%, commands proposed on multiple replicas will almost always be looped to each other with more and more dependencies that are not committed yet.

@snaury
Copy link

snaury commented Mar 15, 2020

@drmingdrmer please look at the following diagram:

r | time ->     |
--+-------------+
1 | A B C D     |
2 | a b c d     |
3 |   b c d E   |
4 |         e a |
5 |         e   |

Here conflicts are pairwise, i.e. A <-> B <-> C <-> D <-> E and also A <-> E. After this sequence of events (where A is propose and a is pre-accept). The graph will eventually look like this:

A <- B <- C <- D <- E
|                   ↑
+-->---->--->--->---+

Now let's say replica 5 receives some (but not all) commit messages, what would be the first instance it executes? If you answer B (seems to have the lowest rank, since it only depends on A, which would get inflated rank due to E), let's say E no longer conflicts with D (same graph, but no edge between D and E), is the answer still B? Would replica 1 (if it did not receive commit for E) execute B first? (after all rank(B) = B < rank(A) = A+E, if I understand correctly) What would replica 5 execute first?

@snaury
Copy link

snaury commented Mar 15, 2020

Ah, I think I'm starting to understand. What you propose is to accumulate transitive conflicts of each conflicting instance you encounter during pre-accept, this way you can glimpse what could have been chosen during a commit. After conflicting instances A and B are committed, if A conflicts are a subset of what pre-accept of B seen for A, then B has a non-reversible dependency on A. Otherwise there is potential for overlap, making dependency effectively bidirectional, with direction decided based on rank.

That's an interesting idea, but I'm not sure it works in some edge cases, where dependencies don't match exactly because of different quorums.

| 1 2 3 |
+-------+
|   C D | C and D are proposed
| A     | A is proposed, initially A.complete_deps=[]
| c     | PreAccept(C) reaches replica 1, reply PreAcceptOK(C, complete_deps=[A])
|   a   | PreAccept(A) reaches replica 2, reply PreAcceptOK(A, complete_deps=[C])
|   C   | PreAcceptOK(C, complete_deps=[A]) received, broadcast Accept(C, complete_deps=[A])
| C C   | Accept and Commit(C, complete_deps=[A])
|     a | PreAccept(A) reaches replica 3, reply PreAcceptOK(A, complete_deps=[D])
|     B | B is proposed, initially B.complete_deps=[A, D]
| b     | PreAccept(B) reaches replica 1, reply PreAcceptEQ(B, complete_deps=[A, D])
|     B | Fast Commit(B, complete_deps=[A, D])
| A     | PreAcceptOK(A, complete_deps=[C]) received, broadcast Accept(A, complete_deps=[C])
|     A | Accept(A, complete_deps=[C]) received, reply AcceptOK(A, complete_deps=[C])
| A A   | AcceptOK(A, complete_deps=[C]) received, Commit(A, complete_deps=[C])
|   D D | Commit(D, complete_deps=[A, B, C])

Final graph will have the following dependencies and ranks:

A -> C
C -> A
B -> A, D
D -> A, B, C

A <-> C
^  \  ^
|   \ |
B <-> D

Notice how there are two loops A <-> C and B <-> D, with dependency from one loop to another. When replica 2 receives commits A and C they will be executed (in some order), because it cannot see B and D yet. As far as I can understand your rules B is not ordered after A because C is not a subset of A, D, so the order is determined by rank. When replica 3 receives all commits it may choose different order, e.g. let's say instance-ids are:

  • A = 1000
  • C = 1000
  • D = 100
  • B = 101

Then we have:

  • rank(A) = A+C = 2000
  • rank(C) = C+A = 2000
  • rank(B) = A+B = 1101
  • rank(D) = A+C+D = 2100

We would see B having the lowest rank, what's stopping us from executing it before A and C? (even though dependency-wise B clearly goes after A and C)

Unfortunately contradictions are still possible. Looking at subset/superset is a nice idea, but it cannot guarantee edges will be present on both ends of potentially intersecting commands.

@drmingdrmer
Copy link
Author

r | time ->     |
--+-------------+
1 | A B C D     |
2 | a b c d     |
3 |   b c d E   |
4 |         e a |
5 |         e   |

Now let's say replica 5 receives some (but not all) commit messages, what would be the first instance it executes? If you answer B (seems to have the lowest rank, since it only depends on A, which would get inflated rank due to E),

With this time flow the final rank would be:

A: A+E
B: B
C: C
D: D
E: D+E

And it is a loop thus my answer is B.

let's say E no longer conflicts with D (same graph, but no edge between D and E), is the answer still B? Would replica 1 (if it did not receive commit for E) execute B first? (after all rank(B) = B < rank(A) = A+E, if I understand correctly) What would replica 5 execute first?

A: A+E
B: B
C: C
D: D
E: 

B or E could both be executed first, on different replica:

  • R1 would execute B first if R1 did not receive commit-E yet.
  • R5 would execute E first if R5 did not receive commit-B yet.

Since D and E do not interfere, it does not matter. The possible order is E BCD A or BCD E A.

PS. I'm gonna answer your question in the second last comment right away.:DDD

@drmingdrmer
Copy link
Author

complete_deps is decided the same as deps, in pre-accept phase only.

Wouldn't it be just normal deps from original epaxos then, what's the difference?

The difference is complete_deps includes all indirect dependent instances.
E.g. A -> B, B -> C then C is in A.complete_deps, not just B.

The content of complete_deps is almost the same as your accumulated deps,
But no need to update it in accept phase.

But if it's decided in pre-accept phase, then it's not even close to being the
same, my accumulated deps are accumulated during accept phase, how can it be
the same if yours happens in pre-accept phase?

Except the difference of the phase when they are formed, they play the same role
in determining the order.
They are both the just-big-enough sub graph of an SCC for determining the order.

Here it is modified: add X itself into the set thus there wont be rank(A) == rank(B).

I'll take your word for it, but I don't understand how you could ever
guarantee that, especially in very complex loops. You even show a table with
values for rank(A), rank(B) and rank(C), but A, B and C could have
the same instance id (it's assigned at propose time and local to each
replica), so wouldn't they be all equal to each other?

Yes, this is not very strict in my statement.
Thanks for pointing out.
Adding X into the set only eliminates part of the problem.
To distinguish every instance, we still need to sort by (rank(X), instance_id(X)), or something else.

First, at t=1, A received 2 pre-accept replies, R1 could commit on fast-path.

I omitted fast path, because it's irrelevant to the issue (it's basically a
cheat where you got lucky and something happened the same way on all replicas,
being lucky is good for performance, but algorithm must be safe when fast path
doesn't exist). Working around fast path is always possible, but makes graph
messy, it's easier to consider two instances instead of 3, 4, 5, etc.

The complete_deps is collected in pre-accept-phase.
Both slow-path and fast-path go through pre-accept-phase, thus it wont be a
problem I think.

If the entire SCC is committed

The problem I'm trying to solve is how to start executing the graph when
entire SCC is not committed, and may never be committed, because it never
stops growing. Think what happens when conflict rate is close to 100%,
commands proposed on multiple replicas will almost always be looped to each
other with more and more dependencies that are not committed yet.

I said If the entire SCC is committed to simplify the core of this idea.
It is not a requirement for this algorithm to work.
complete_deps contains enough information to find the first instance to
execute.

I assume you got the gist of this from your next comments thus I'm not going too
deep to explain :). I'm gonna see what's in your next next comment:DDD.

@drmingdrmer
Copy link
Author

Ah, I think I'm starting to understand. What you propose is to accumulate
transitive conflicts of each conflicting instance you encounter during
pre-accept, this way you can glimpse what could have been chosen during a
commit. After conflicting instances A and B are committed, if A
conflicts are a subset of what pre-accept of B seen for A, then B has a
non-reversible dependency on A. Otherwise there is potential for overlap,
making dependency effectively bidirectional, with direction decided based on
rank.

Yes I had a little trouble finding a correct word for the transitive conflicts.
:(((

That's an interesting idea, but I'm not sure it works in some edge cases,
where dependencies don't match exactly because of different quorums.

| 1 2 3 |
+-------+
|   C D | C and D are proposed
| A     | A is proposed, initially A.complete_deps=[]
| c     | PreAccept(C) reaches replica 1, reply PreAcceptOK(C, complete_deps=[A])
|   a   | PreAccept(A) reaches replica 2, reply PreAcceptOK(A, complete_deps=[C])
|   C   | PreAcceptOK(C, complete_deps=[A]) received, broadcast Accept(C, complete_deps=[A])
| C C   | Accept and Commit(C, complete_deps=[A])
|     a | PreAccept(A) reaches replica 3, reply PreAcceptOK(A, complete_deps=[D])
|     B | B is proposed, initially B.complete_deps=[A, D]
| b     | PreAccept(B) reaches replica 1, reply PreAcceptEQ(B, complete_deps=[A, D])
|     B | Fast Commit(B, complete_deps=[A, D])
| A     | PreAcceptOK(A, complete_deps=[C]) received, broadcast Accept(A, complete_deps=[C])
|     A | Accept(A, complete_deps=[C]) received, reply AcceptOK(A, complete_deps=[C])
| A A   | AcceptOK(A, complete_deps=[C]) received, Commit(A, complete_deps=[C])
|   D D | Commit(D, complete_deps=[A, B, C])

Final graph will have the following dependencies and ranks:

A -> C
C -> A
B -> A, D
D -> A, B, C

A <-> C
^  \  ^
|   \ |
B <-> D

Notice how there are two loops A <-> C and B <-> D, with dependency from
one loop to another. When replica 2 receives commits A and C they will be
executed (in some order), because it cannot see B and D yet. As far as I
can understand your rules B is not ordered after A because C is not a
subset of A, D, so the order is determined by rank. When replica 3 receives
all commits it may choose different order, e.g. let's say instance-ids are:

  • A = 1000
  • C = 1000
  • D = 100
  • B = 101

Then we have:

  • rank(A) = A+C = 2000
  • rank(C) = C+A = 2000
  • rank(B) = A+B = 1101
  • rank(D) = A+C+D = 2100

We would see B having the lowest rank, what's stopping us from executing it
before A and C? (even though dependency-wise B clearly goes after A
and C)

Unfortunately contradictions are still possible. Looking at subset/superset is
a nice idea, but it cannot guarantee edges will be present on both ends of
potentially intersecting commands.

We allow B to be executed before AC.

First I concluded the interfering relations are as below(as you did not specify
but it is obvious:DDD):

A -- ~ -- C
| \       |
~  ` ~ .  ~
|       \ |
B -- ~ -- D

This situation is a little bit complicated because it depends on what
fast-commit condition the system uses.

According to epaxos specified, there are 3 contrains for fast-commit, a system
must choose one of them(to make recovery to work):

  • 1: Leader sends pre-accept-request to only fast-quorum, no other replicas, and
    receives identical replis.

  • Leader sends pre-accept-request to arbitrary number of replicas, but:

    • 2: the updated deps(or complete_deps) must be the same as its initial value and constitutes a fast-quorum.

    • 3: the updated deps(or complete_deps) must all be committed and constitutes a fast-quorum.

In this scenario, it could be the second or the third constrain that the system
has chosen(because A sends pre-accept to 2 other replicas).

And B is not initiated after A became safe(B saw A on R1, the value of A
on R1 is not the final value of A).
Thus any consistent order for A and B is reasonable.

Thus in this situation execute B before A is reasonable.

@snaury
Copy link

snaury commented Mar 15, 2020

depends on what fast-commit condition the system uses

In my example only B is using fast commit, but it's just a minor artifact. Fast commit complicates recovery, but otherwise it's perfectly fine to send pre-accept to all replicas and wait for a single reply. For a 3 replica system fast quorum is 2 replicas.

And B is not initiated after A became safe(B saw A on R1, the value of A on R1 is not the final value of A).
Thus any consistent order for A and B is reasonable.
Thus in this situation execute B before A is reasonable.

I'm not saying it's unreasonable, it's perfectly reasonable. I'm saying the execution order is inconsistent:

  • Replica 2: A, C, B, D
  • Replica 3: B, A, C, D

Don't you think that's an issue? My example is complicated, but illustrates how it's possible to have A in B.complete_deps (but not vice versa), yet rank(B) < rank(A) (so B is executed first). Think of it this way, your rank(X) rule flips some dependencies in the graph, but then some instances don't see what they must depend on.

I think for it to be safe you must show either:

  • If A is in B.complete_deps, but B is not in A.complete_deps, then A always executes before B, that contradicts any example where there's a one-way loop of 3 or more instances, so it's necessary to flip at least one dependency in this case (i.e. you need to handle SCCs and wait for all transitive dependencies to be committed first). That's what original epaxos does.
  • If (same-as-above), and B executes before A (the order flips due to some external condition, like rank), then you must guarantee B is always (somehow) discovered before or together with A, otherwise A not having a dependency on B may allow some replica to execute A before B, making execution order inconsistent. That's what you seem to be trying to do, but I don't see how you can guarantee discoverability.
  • If the order between A and B is in any way uncertain until both are committed, then both A and B must have each other in complete_deps. That's what I'm doing (the order between connected instances is always based on seq, i.e. the graph is undirected). I pay the price with some dependencies being unstable/spurious, but I show that unstable/spurious dependencies are always in the direction of lower seq, so the order is always consistent.

I would love for the second option to work somehow (it would then be possible to use with bpaxos, where dependency service is separate from command leaders), but I don't see how it can work in a single pass like you propose. :(

@drmingdrmer
Copy link
Author

drmingdrmer commented Mar 16, 2020

2020-05-21: The following solution is not proved correct. Refer to #19 for a proved exec-algo without livelock.

I did not yet have an answer to your previous comment.
But I have another idea it seems more clear:|:

Break loop

Easy to see that every loop contains at least one instance entered accept phase(slow-path).
This instance is the point to break the loop:

If X depends on Y but Y.complete_deps ⊄ X.complete_deps, it means Y changed after X saw it.
In this case, X do not need to execute after Y, since the condition of execution linearizability X executes after Y if X is proposed after Y is committed does NOT hold. Removing this relation does not break the linearizability guarantee.

Thus we remove the relation X->Y from the graph.
After removing all these unnecessary relations, every loop can be broken. The resulting graph has no loop(but there is some orphan vertex).
We call the resulting graph G'.

Execution

1 Then choose the instance X with minimal id, wait for all its complete_deps to be committed.

2 Following the path in G', go to a minimal sibling instance and repeat from step 1.
Until there are no outgoing edges, execute the instance.

Consistency:

This is not currect. I have a fix here: #14 (comment)

If two replicas have chosen different min id instance:

  • If there is a path between them in G', then one of the replicas will find the other instance alone the path.
  • If there is not a path between them, the two different instances do not need an order constrain.

What do you think guys?

@drmingdrmer
Copy link
Author

BTW, what is bpaxos? I did not find a paper on google. May you share me a link?

@snaury
Copy link

snaury commented Mar 16, 2020

BTW, what is bpaxos? I did not find a paper on google. May you share me a link?

It's kind of off topic, but here's where I initially read about it: https://arxiv.org/abs/2003.00331

@drmingdrmer
Copy link
Author

drmingdrmer commented Mar 17, 2020

2020-05-21: The following solution is not proved correct. Refer to #19 for a proved exec-algo without livelock.

Fix: in previous comment I said

every loop contains at least one instance entered accept phase(slow-path).

It is not correct. X saw an old version of Y does not mean Y enters accept
phase. But the condition to check it: Y.complete_deps ⊄ X.complete_deps would
still be correct.

This should fix the consistency problem in my previous comment.

Add order to tip vertex

For some vertices in G', if none of them has outgoing edges, it is safe to
assign some total order to them so that no loop is formed.

We could just use instance-id to sort these instances.

Execution

To execute one instance is quite simply:

  • Select the one instance.

  • Follow after edge to walk to a tip vertex(T), which has no more outgoing
    edges. E.g. in each step, choose the sibling with the minimal id.

    The after edge means that X depends on Y and Y is not changed
    after X had seen it, as I previously explained.

  • For every instance X in T.deps(not complete_deps) without outgoing edge, there is an
    order between X and T:

    execute T or some X by their instance-id order.

Consistency

If two replicas(R1, R2) have chosen different tip instance X1, X2 to execute:

If X1 ~ X2, then they have an total order, finally they will choose the same
one.

Otherwise, execute either of X1, X2 is ok.

@drmingdrmer
Copy link
Author

drmingdrmer commented Mar 17, 2020

2020-05-21: The following solution is not proved correct. Refer to #19 for a proved exec-algo without livelock.

depends on what fast-commit condition the system uses

In my example only B is using fast commit, but it's just a minor artifact.
Fast commit complicates recovery, but otherwise it's perfectly fine to send
pre-accept to all replicas and wait for a single reply. For a 3 replica system
fast quorum is 2 replicas.

Not like that. If we do not use the fast-commit-condition-1(send pre-accept to
exactly a fast-quorum), we must choose one of the other two.
With a 3 replicas setup, sending 2 pre-accept-request confuses the recovery:

A ~ B ~ C; C ~ A

1 2 3
-----
A B C | init A, B, C
  a a | A send pre-accept to B, C, R2 has A->B, R3 has A->C
A     | A received pre-accept-reply from B, it fast-commit with A->B.
  p p | Recovery process sends prepare to B, C, then it can not tell which value
      | would have been committed, A->B or A->C.
      | Thus here there must be one of the 3 constrains introduced to solve this.

This is what I understand about epaxos. But it is a little out of our topic. :|

I'm not saying it's unreasonable, it's perfectly reasonable. I'm saying the
execution order is inconsistent:

  • Replica 2: A, C, B, D
  • Replica 3: B, A, C, D

I have to say you must be my hero. :DDD

  • If A is in B.complete_deps, but B is not in A.complete_deps, then
    A always executes before B, that contradicts any example where there's a
    one-way loop of 3 or more instances, so it's necessary to flip at least one
    dependency in this case (i.e. you need to handle SCCs and wait for all
    transitive dependencies to be committed first). That's what original epaxos
    does.

This seems to be the simplest directy to solve the problem.
I followed this way and got the idea in my previous comment:
#14 (comment)

I think it works but as always, waiting for your opinions. :DDD

  • If (same-as-above), and B executes before A (the order flips due to some
    external condition, like rank), then you must guarantee B is always
    (somehow) discovered before or together with A, otherwise A not having a
    dependency on B may allow some replica to execute A before B, making
    execution order inconsistent. That's what you seem to be trying to do, but I
    don't see how you can guarantee discoverability.

To discover an instance that no present instances depend on does not seem to be
a task can be done in finite time.
Tricks could be periodically propose some Noop instance. Thus empty instance
slot will be filled and we could discover whether there is a B exist in a
duration.
But it is ugly because it tries to fix an algorithm level problem at engineering level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants