-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explaining diffs of Merge<T>
(structured conflicts part 1)
#4259
base: main
Are you sure you want to change the base?
Conversation
Merge<T>
, part one of #4251Merge<T>
(structured conflicts part 1)
assert_eq!(c(&[0, 1], &[2, 3, 1]).simplify(), c(&[0], &[2, 3])); | ||
assert_eq!(c(&[0, 1], &[2, 3, 2]).simplify(), c(&[0, 1], &[2, 3, 2])); | ||
assert_eq!(c(&[0, 1], &[2, 3, 3]).simplify(), c(&[0, 1], &[2, 3, 3])); | ||
assert_eq!(c(&[0, 1], &[2, 3, 4]).simplify(), c(&[0, 1], &[2, 3, 4])); | ||
assert_eq!( | ||
c(&[0, 1, 2], &[3, 4, 5, 0]).simplify(), | ||
c(&[1, 2], &[3, 5, 4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit dee5dff changed the order to be what it is here. I'm actually thinking more and more that we should try to interpret conflicts as a base (the first positive term) and a series of diffs/patches. Copy tracing is one reason for that, because the copies are associated with a diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. I'll think a bit on it. Technically, I can split the simplification function machinery into two similar function (in other words, leave the Merge simplification machinery alone); I'm not sure whether that's the best solution.
we should try to interpret conflicts as a base (the first positive term) and a series of diffs/patches
Firstly, I'd call the first positive term a "side" :)
I once convinced myself that our options are either:
- Allow simplification cancel any add of a conflict against any remove, and be completely flexible about the order the adds & removes come in. This destroys the identity of diffs in the conflict whenever a simplification happens.
- Be quite strict about the order and only allow cancelling out neighboring terms. This would allow each "diff" in the conflict to mostly preserve its identity. However, this would mean that if you have
X -> Y -> Z
and reorder it asX -> Z' -> Y'
,Y'
will usually be conflicted.
So, I've been trying to get all we can get from the first of these options.
I need to look back on my notes/thoughts and see whether I still think that the intermediate options aren't very viable. It's possible that I attached some weight to an idea I no longer believe, namely that a merge of X
and Y
should be considered the same as the merge of Y
and X
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- However, this would mean that if you have
X -> Y -> Z
and reorder it asX -> Z' -> Y'
,Y'
will usually be conflicted.
That will give you this:
Z'=X+(Z-Y)
Y'=Z'+(Y-X)=X+(Z-Y)+(Y-X)=X+(Z-X)=Z
Simplifying Y'
to Z
only requires chaining diffs and treating the first term as a diff from the missing state. Right? To clarify, I don't object to chaining diffs (making A->B
plus B->C
into A->C
). My problem is with converting e.g. A->B
plus C->D
into A->D
plus C->B
.
Btw, when diffing e.g. A+(B-C)
to D+(E-F)
, I would prefer to display the diffs D-A
, E-F
, and C-B
(reverse of B-C
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not addressing the reordering issue, still need to think or chat about that)
Btw, when diffing e.g. A+(B-C) to D+(E-F), I would prefer to display the diffs D-A, E-F, and C-B (reverse of B-C).
This is how what I called DiffOfMerge
treats it, more or less (and I am very much considering reworking this PR to make it treat it that way, exactly). I was first intending to present that to the user more or less directly, but then decided it would be too confusing. So, I built all the DiffExplanationAtom
machinery on top of that.
For the question of whether it's better to show (C-B) or -(B-C), the main example I have in mind is as follows. Suppose we resolve a conflict A+(B-C) to D where my made-up conflict is:
>>>>>
The blue heron
===============
- The blue pig
+ The flying pig
<<<<<
and the resolution is "The flying heron". If we show the diff D - (A + (B-C))
as (D-A) + (C-B), as you suggested, it'd be:
>>>>>
===== Part 1 diff (D-A)
- The blue heron
+ The flying heron
===== Part 2 diff (C-B)
+ The blue pig
- The flying pig
<<<<<
I find it hard to look at this and see what is going on, even though this is perfectly correct from the perspective of conflict theory.1
If I use my "DiffExplanation" presentation, I'm hoping jj
will present this diff as follows:
>>>>>
===== Part 1: Change conflict side
- The blue heron
+ The flying heron
===== Part 2: Removed conflict diff (B-C)
-| - The blue pig
-| + The flying pig
<<<<<
Here, there is a way to immediately see that the conflict was resolved "correctly": I can quickly see that the diff in part 1 means the same thing as the removed diff in part 2. This would be relatively easy to teach to users.
I think there were a couple more examples I had in mind, but I need to remember them and write them down (I am kicking myself a little for not having them written down already). For example, strictly speaking, we could omit any "UnchangedSide"-s from the presentation of the conflict diff, since they are not different, but I believe that could get quite confusing.
Footnotes
-
Less importantly, I think users who are less familiar with conflict theory will also be confused to see
+ The blue pig
where this was not present anywhere else these conflicts were presented, and because this promotes a conflict base (that people tend to ignore) to a place of great visibility. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to wrap up a reply,
Simplifying Y' to Z only requires chaining diffs and treating the first term as a diff from the missing state. Right?
It think you are correct; it seems like my example wasn't actually an example. Z'
being equivalent to Z
seems to rely only on associativity and cancelling out of neighboring terms, no reordering needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think you meant ... = B + (A - C) here (as you also say further down).
Yes, thanks, fixed.
And then the same chaining of patches applies,
(This part of the comment is probably less relevant, unless we want to consider quite weird theories as in my footnote above)
I think this chaining is different. If we don't allow commuting, we could say that (A - B) + (B - C) = (A-C)
but that (A - C) + (B - A)
might differ from (B - C)
. So, we don't have to say that A + (B - A) = A
.
However, as long as we want jj restore --from @- --to X; jj rebase -r @ -d X
to never create a conflict, we probably want to say that A + (B-A)
and B
are equivalent. This has some implications, at the very least that A + (B - C) = C + (B - A)
(that is, 2-sided conflicts commute), as I explained above.
Update/attempt at clarification that could be itself confusing: As I said above and say again below below, I'm not sure whether there are conflict theories where (A - C) + (B - A)
differs from (B - C)
, but A + (B - A) = B
(and therefore A + (B - C) = B + (A -C)
). If such a theory can be imagined, it'd be a weird one IMO.
OTOH, let's say we assume that (A - C) + (B - A) = (B - C)
as you did in your comment (and which seems like a reasonable thing to want). As you pointed out, this definitely means that A + (B - A) = B
, which means that A + (B - C) = B + (A - C)
as I explained in my previous comment. In other words, your assumption is stronger than mine was (to be precise, it's at least as strong as far as I know at this moment).
I think that this stronger assumption also implies that we can also swap top/bottom terms around at will. For example, (A - C) + (B - D) = (B - C) + (A - D)
because:
(A - C) + (B - D) = (A - C) + ( (B - A) + (A - B) ) + (B - D) =
= ( (A - C) + (B - A) ) + ( (A - B) + (B - D) ) =[simplify 2nd summand]=
= ( (A - C) + (B - A) ) + ( A - D ) =[by assumption]=
= (B - C) + (A - D)
I'm pretty sure the same can be done for the bottom terms, and once you can swap two terms on the top, you can reorder at will.
So, if there's a conflict theory that doesn't allow arbitrary reordering, it would have to consider (A - C) + (B - A)
and (B - C)
different, at least in some cases (or not be associative, or not allow the more natural simplifications I used above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense. Thanks for writing it down. I don't think it's theoretically incorrect to swap bases and sides between diffs. I just think we should avoid it when we can so we don't present the user with a diff that they have never created themselves. OTOH, we do already do that in conflicts markers, but maybe it's more natural there because we show either the diff from the rebased commit or the diff from upstream (in the common case). Maybe we should change the code to only consider doing that swapping with the "upstream" (first positive term in the conflict - that might actually be a decent name for the first term, btw).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Update from Ilya: I moved this comment to #4259 (comment), since I just put a long post on a different topic in this thread.]
FWIW, I guess it would be easier to follow if conflicts aren't reordered per hunk.
For example, from: A, to: B+(D-C)
could be rendered as follows:
# line numbers
# contents
A B: C D: .. context line ..
A B: : .. hunk of [A, B] .. # conflict
: C D: .. hunk of [C, D] ..
A B: C D: .. context line ..
A B: : .. hunk of [A, B] .. # non-conflicting change
...
A B: C D: .. context line ..
A B: : .. hunk of [A, B] .. # another conflict (in the same order)
: C D: .. hunk of [C, D] ..
A B: C D: .. context line ..
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think there was a subtle but important mistake in my logic.
TLDR (written after the fact)
Fortunately (amusingly?), my mistakes cancel out, in a way 😅, with similar problems with one of Martin's arguments above. TLDR, what I now again believe is what I said without justification before:
If we are strict about the ordering of diffs in a conflict, then if you have commits
X -> Y -> Z
and reorder it asX -> Z' -> Y'
,Y'
will usually be conflicted. In other words, if we requireY'=Z
in this example, it follows that conflicts with the same set of adds/removes (in any order) must be equivalent.
What still seems true
The part that continues to hold is that if you assume that
(A - B) + (B - C) = A - C = (B - C) + (A - B) (*)
then you can reorder the adds or removes arbitrarily. I repeat that proof in a better notation below.
In particular, if we don't want to allow arbitrary reordering, we have to forbid reordering patches in a conflict (which would imply (*)
).
Mistakes
There are two things that I now believe were mistaken, for similar reasons:
- My argument that
A + (B - A) = B
by itself implies thatA + (B - C) = B + (A-C)
. I must have mistakenly used both equalities from(*)
in my "proof", and got away with it because the notation I used was subtly confusing when "+" is not commutative. - Martin's argument from Explaining diffs of
Merge<T>
(structured conflicts part 1) #4259 (comment), if you do not assume both sides of(*)
Notation problem
One way to describe the problem is to say that the notation A + (B - C)
is bad because it doesn't properly alternate signs. A + (-C + B)
or (B - C) + A
is better1. Both of these are a bit awkward to consistently type, so I suggest going back to notation I used a long time ago: [A, C -> B]
instead of A + (-C + B)
. Here, ,
is really a composition operation, perhaps A∘(C->B)
would be better, but it's harder to type. (For typing, the Unicode symbol is un-intuitively called the "ring operator", U+2218). Or should we settle on A*(C->B)
?
With this notation, the "obvious" simplification equality is [C -> B, B -> A] = [C -> A]
. Here, the diffs happen from left to right: turning C into B, and then turning B into A is the same as turning C into A. In the old notation ordered correctly, this would be (-C + B) + (-B + A) = (-C + A)
, which is actually the more awkward-looking right equality in (*)
. The add of the first diff cancels out with the remove of the second diff.
Mistake 1
With the better notation, both [A, A->B] = B
and [B, A->A] = B
look "obvious". The first one is the natural direction of simplification, while the second one is the separate axiom that A->A
is a noop. The first one is the equality that used to be written as A + (B-A) = A
, and which I previously thought was an extra assumption we needed to make.
I no longer think that these by themselves should imply that [A, B ->C] = [C, B->A]
or anything else about reordering of terms.
Mistake 2
Let's now go over Martin's argument with a better notation. As before, we start with the commits X -> Y -> Z
and reorder them via rebase to X -> Z' -> Y'
, where rebase of Z
from Y
onto X
is defined to result in the conflict [X, Y -> Z]
. So,
Z' = [X, Y-> Z]
Y' = [Z', X->Y] = [X, Y->Z, X->Y]
Without the "less natural" direction of (*)
and without reordering of terms, there is no way to simplify Y'
.
As I said before, and will restate below, if we allowed the simplification [Y->Z, X->Y]=[X->Z]
, it would follow that any reordering of top/bottom terms is possible.
Restating the "part that still holds" in the new notation
For completeness, let's restate the "part that still hold" from the beginning of the post in the better notation.
We always assume that [C -> A, A -> B] = [C -> B]
. The claim is that if we also assume the opposite "less natural" direction, [A -> B, C -> A] = [C -> B]
, then we can prove that [A -> B, C -> D] = [C -> B, A -> D]
. The proof goes as:
[A -> B, C -> D] = [A -> B, (C->A, A->C), C->D] =
= [(A -> B, C -> A), (A -> C, C -> D)] =[simplify right parentheses]=
= [(A -> B, C -> A), A -> D] =[less natural simplification we assumed]=
= [C -> B, A -> D]
As before,
- this implies that the "
adds" (Update: "removes") of a conflict can be reordered arbitrarily (though "adds"/"removes" are not good names in this non-commutative case). Update: We could show the same thing for "adds" by inserting(B -> D, D -> B)
after the first equality instead of(C->A, A->C)
. - the "less natural simplification" and therefore arbitrary reordering would follow immediately if we allowed
[A->B, C->D] = [C->D, A->B]
. - this proof uses the associativity of patch composition, which I represented using parentheses. This conclusion wouldn't hold if we compromised on that, probably, but that would be confusing in its own way.
Footnotes
-
These two correspond to different ways of thinking about which way patches compose, though! I'll describe the first ordering in detail below, but if we wanted the left equality of
(*)
to be the "natural" one (as I was mostly assuming in my arguments in previous posts), we'd have to use the second ordering or, equivalently, something like[B <- C; A]
. We'd think of diffs as happening right-to-left. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was chatting about this, and I think it's worthwhile to write down a simpler argument that's slightly less powerful and uses less precise notation but is shorter and easier to understand.
The claim is that if you allow reordering of diffs (B - C) + (D - E) = (D - E) + (B - C)
then it follows that (B - C) + (D - E) = (D - C) + (B - E)
. If you believe this argument, analogous arguments imply that you can reorder adds and removes of any conflict arbitrarily.
The argument is:
(B - C) + (D - E) = (B - C) + ( (D - B) + (B - D) ) + (D - E) =
=reordering_diffs= ((D - B) + (B - C)) + ((B - D) + (D - E))=
(D - C) + (B - E)
If you write this down in more precise notation, you'll notice that in this shorter argument I'm using the "backwards" cancellation rule. However, since we assumed that the order of diffs doesn't matter, that's fine.
I don't follow the idea of #4251, but does it make sense to diff from/to file values all at once, and post-process each hunk? I think the output would look similar to For example, Just an idea. I don't know how the rendering result would look. |
if self.values.len() == 1 { | ||
return self; | ||
} | ||
self.values = simplify_internal(self.values.to_vec()).into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can use .into_vec()
to avoid allocation, but maybe we can make simplify_internal()
not consume the input vec instead.
Btw, also note that it makes sense to display diffs involving conflicts as just repeated diffs. That includes the
|
I'm not sure I'm following exactly, but what you are saying sounds close to what I'm doing. In the next commit of #4251 (outside this PR, called "lib In particular, I didn't understand what you mean by "diffing file values all at once", whether you favor that or whether you thought that's what I was doing wrongly (I don't believe I'm doing that currently). |
This does make sense, but it becomes very confusing to separate the diffs from Also, if the sequence of diffs applies cleanly, I think that implies that the conflict could be simplified to unconflicted text to begin with. |
BTW, I don't know if we prefer |
I think that's what I'm doing. If not, I didn't understand what you mean.
Maybe the ordering of commits is confusing, this PR actually mostly post-processes conflicts of hunks (after the file conflicts are turned into a sequence of hunk conflicts, for each hunk, though I'm no longer sure we mean the same thing by the word "hunk"). The distance is taken into account when processing these hunks, very similarly to how Aside: The reason for the confusing order is that I wanted to look over the commits that split files into conflicted hunks again (what I called "Part 2" in the outline in #4251; I just edited it to try to make it clearer what that is). I tried to de-duplicate the functionality between conflict diffs and conflict materialization, but as a result the former will be less optimized.
What's the difference? In my terminology, a "conflict" that this code deals with is essentially a Perhaps what you are suggesting that my conflict hunks can get pretty long, and that one of them could be split into several hunks. In my language, this would replace one conflicted hunk with a sequence of two (or more) conflicted hunks that could then be simplified and presented independently. Some of these pieces perhaps would even no longer be conflicted after that simplification. If so, I think this would be great, and in my mind the place for this is inside the algorithm that splits the original single-hunk conflict (the conflict where each side is as long as the entire file) into a sequence of conflicted hunks. Currently, that happens in later commits outside this PR (again, what I called "Part 2"). I do this very naively (it's really a placeholder algorithm for a proof of concept), but I do do it. I was hoping to postpone thinking about improving the algorithm (which could involve some non-trivial computer science, but there could also be some low-effort improvements) until later PRs.
Taking distance into account seems significant for |
Maybe I misunderstood that. Thanks for the clarification. I thought the added function would process So the difference is whether or not reorder (and classify?) hunks, I think.
I mean the former is // file -> hunk -> conflict -> diff_lines
for file in tree_diff {
show_file_header(file);
for hunk in Diff::by_line([from, to, from, to, ...]) {
match hunk {
Different([from, to, from, to, ...]) => {
for (from, to) in [...].chunks(2) {
show_color_words_hunk(from, to)
}
}
}
}
} and the latter is // file -> conflict -> hunk -> diff_lines
for file in tree_diff {
for (from, to) in [from, to, from, to, ...].chunks(2) {
show_file_header(file);
for hunk in Diff::by_line([from, to]) {
match hunk {
for (from, to) in [...] {
show_color_words_hunk(from, to)
}
}
}
}
} Implementation-wise, the latter is simple, but we'll probably want the former?
I have no idea about |
Yes, my example from #4259 (comment) was to illustrate that it's always correct (but not necessarily intuitive) to display diffs involving conflicts as repeated diffs. For
I think the algorithm in #4176 does not require measuring edit distances. It basically just keeps track of where each line came from by walking backwards through history until the line doesn't exist in the parents. So I think that handles the case where you resolve a conflict in a merge commit. There's also the case of calculating blame across linear commits where some commits have conflicts. I'm not sure what makes most sense to do there. Maybe we consider a line as coming from a parent with a conflict if it appears in any side (positive term) of the conflict. |
(Re reordering) I don't have a final conclusion, but I just figured out a theoretical limitation on forbidding reordering. It seems to forbid too much, see #4259 (comment). In addition, I'll mention more explicitly the main example that floats in my mind about reordering: Let's say we have a commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a thread for @yuja 's comment #4259 (comment), copied below, so that it doesn't get lost:
FWIW, I guess it would be easier to follow if conflicts aren't reordered per hunk.
For example, from: A, to: B+(D-C)
could be rendered as follows:
# line numbers
# contents
A B: C D: .. context line ..
A B: : .. hunk of [A, B] .. # conflict
: C D: .. hunk of [C, D] ..
A B: C D: .. context line ..
A B: : .. hunk of [A, B] .. # non-conflicting change
...
A B: C D: .. context line ..
A B: : .. hunk of [A, B] .. # another conflict (in the same order)
: C D: .. hunk of [C, D] ..
A B: C D: .. context line ..
...
96ca089
to
b360d62
Compare
c276599
to
eddc8ee
Compare
576b347
to
58d9498
Compare
877d6a0
to
9ce8a45
Compare
caf1d84
to
a545479
Compare
These will be used for diffs of Merge-s. This also refactors one of the less important tests. I did it mostly automatically with `ast-grep` (which I don't necessarily recommend), but we could also just delete this test I think.
This reorders the result slightly and makes it easier to generalize this to work on diffs of merges (where there is the same number of adds and removes).
This represents a diff of two merges, up to reordering and cancelling out of terms. It can also be thought of as a stack of `n` diffs. This type is not very suitable for presenting to the user (since it loses information they might want to see), but is very natural from the perspective of conflict theory.
…tion For textual conflicts, the distance function will be the size of the diff between two chunks of text.
…ture This turns two Merge-s into an object that represents a way to present their difference to the user, in the form of `Vec<DiffExplanationAtom>`. `DiffExplanationAtom` is the key structure for the data model of differences between diffs. The algorithm is inefficient and far from perfect, for now my focus is on the data model.
This part of #4251 is more cooked than the rest, and contains the fundamental data model. There are still some TODOs for some parts of #4251, but I'd appreciate other people's input for which of them are important to fix immediately and what approach people prefer.
See #4251 for more discussion on the design.
This PR is that it does not add any new features, only new plumbing.
Checklist
If applicable:
CHANGELOG.md