Show a diff of all files modified in the current branch against HEAD:
alias brdiff="def_branch=\$(git symbolic-ref refs/remotes/origin/HEAD | sed 's@^refs/remotes/origin/@@'); git diff origin/\$def_branch..."
Same, but ignore white-space differences, adding --ignore-space-at-eol
or -w
:
alias brdiff-nows="def_branch=\$(git symbolic-ref refs/remotes/origin/HEAD | sed 's@^refs/remotes/origin/@@'); git diff -w origin/\$def_branch..."
List all the files that were added or modified in the current branch compared to HEAD:
alias brfiles="def_branch=\$(git symbolic-ref refs/remotes/origin/HEAD | sed 's@^refs/remotes/origin/@@'); git diff --name-only origin/\$def_branch..."
Once we have the list, we can now automatically open an editor to load just added and modified files:
alias bremacs="def_branch=\$(git symbolic-ref refs/remotes/origin/HEAD | sed 's@^refs/remotes/origin/@@'); emacs \$(git diff --name-only origin/\$def_branch...) &"
(note to self: this is a sync from the-art-of-debugging/methodology.md
which is the true source)
The discussed next approach should work for any revision control system that supports bisecting. We will use git bisect
in this discussion.
git bisect
helps to quickly find the commit that caused a certain problem.
Use case: Say, you were using transformers==4.33.0
and then you needed a more recent feature so you upgraded to the bleed-edge transformers@main
and your code broke. There could have been hundreds of commits between the two versions and it'd be very difficult to find the right commit that lead to the breakage by going through all the commits. Here is how you can quickly find out which commit was the cause.
footnote: HuggingFace Transformers is actually pretty good at not breaking often, but given its complexity and enormous size it happens nevertheless and the problems are fixed very quickly once reported. Since it's a very popular Machine Learning library it makes for a good debugging use case.
Solution: Bisecting all the commits between the known good and bad commits to find the one commit that's to blame.
We are going to use 2 shell terminals: A and B. Terminal A will be used for git bisect
and terminal B for testing your software. There is no technical reason why you couldn't get away with a single terminal but it's easier with 2.
- In terminal A fetch the git repo and install it in devel mode (
pip install -e .
) into your Python environment.
git clone https://github.com/huggingface/transformers
cd transformers
pip install -e .
Now the code of this clone will be used automatically when you run your application, instead of the version you previously installed from PyPi or Conda or elsewhere.
Also for simplicity we assume that all the dependencies have already been installed.
- next we launch the bisecting - In terminal A, run:
git bisect start
- Discover the last known good and the first known bad commits
git bisect
needs just 2 data points to do its work. It needs to know one earlier commit that is known to work (good
) and one later commit that is know to break (bad
). So if you look at the sequence of commits on a given branch it'd have 2 known points and many commits around these that are of an unknown quality:
...... orig_good ..... .... .... .... ..... orig_bad ....
------------->---------------->----------------> time
So for example if you know that transformers==4.33.0
was good and transformers@main
(HEAD
) is bad, find which commit is corresponding to the tag 4.33.0
by visiting the releases page and searching for 4.33.0
. We find that it was commit with SHA 5a4f340d
.
footnote: typically the first 8 hex characters are enough to have a unique identifier for a given repo, but you can use the full 40 character string.
So now we specify which is the first known good commit:
git bisect good 5a4f340d
and as we said we will use HEAD
(latest commit) as the bad one, in which case we can use HEAD
instead finding out the corresponding SHA string:
git bisect bad HEAD
If however you know it broke in 4.34.0
you can find its latest commit as explained above and use that instead of HEAD
.
We are now all set at finding out the commit that broke things for you.
And after you told git bisect
the good and the bad commits it has already switched to a commit somewhere in the middle:
...... orig_good ..... .... current .... .... ..... orig_bad ........
------------->--------------->---------------->----------------> time
You can run git log
to see which commit it has switched to.
And to remind, we installed this repo as pip install -e .
so the Python environment is instantly updated to the current commit's code version.
- Good or bad
The next stage is telling git bisect
if the current commit is good
or bad
:
To do so in terminal B run your program once.
Then in terminal A run:
git bisect bad
If it fails, or:
git bisect good
if it succeeds.
If, for example, if the result was bad, git bisect
will internally flag the last commit as new bad and will half the commits again, switching to a new current commit:
...... orig_good ..... current .... new_bad .... ..... orig_bad ....
------------->--------------->---------------->----------------> time
And, vice versa, if the result was good, then you will have:
...... orig_good ..... .... new_good .... current ..... orig_bad ....
------------->--------------->---------------->----------------> time
- Repeat until no more commits left
Keep repeating step 4 step until the problematic commit is found.
Once you finished bisecting, git bisect
will tell you which commit was responsible for breaking things.
...... orig_good ..... .... last_good first_bad .... .. orig_bad ....
------------->--------------->---------------->----------------> time
If you followed the little commit diagrams, it'd correspond for thefirst_bad
commit.
You can then go to https://github.com/huggingface/transformers/commit/
and append the commit SHA to that url which will take you to the commit, (e.g. https://github.com/huggingface/transformers/commit/57f44dc4288a3521bd700405ad41e90a4687abc0
and which will then link to the PR from which it originated. And then you can ask for help by following up in that PR.
If your program doesn't take too long to run even if there are thousands of commits to search, you are facing n
bisecting steps from 2**n
so 1024 commits can be searched in 10 steps.
If your program is very slow, try to reduce it to something small - ideally a small reproduction program that shows the problem really fast. Often, commenting out huge chunks of code that you deem irrelevant to the problem at hand, can be all it takes.
If you want to see the progress, you can ask it to show the current range of remaining commits to check with:
git bisect visualize --oneline
- Clean up
So now restore the git repo clone to the same state you started from (most likely `HEAD) with:
git bisect reset
and possible reinstall the good version of the library while you report the issue to the maintainers.
Sometimes, the issue emerges from intentional backward compatibility breaking API changes, and you might just need to read the project's documentation to see what has changed. For example, if you switched from transformers==2.0.0
to transformers==3.0.0
it's almost guaranteed that your code will break, as major numbers difference are typically used to introduce major API changes.
- Possible problems and their solutions:
a. skipping
If for some reason the current commit cannot be tested - it can be skipped with:
git bisect skip
and it git bisect
will continue bisecting the remaining commits.
This is often helpful if some API has changed in the middle of the commit range and your program starts to fail for a totally different reason.
You might also try to make a variation of the program that adapts to the new API, and use it instead, but it's not always easy to do.
b. reversing the order
Normally git expects bad
to be after good
.
...... orig_good ..... .... .... .... ..... orig_bad ....
------------->--------------->---------------->----------------> time
Now, if bad
happens before good
revision order-wise and you want to find the first revision that fixed a previously existing problem - you can reverse the definitions of good
and bad
- it'd be confusing to work with overloaded logic states, so it's recommended to use a new set of states instead - for example, fixed
and broken
- here is how you do that.
git bisect start --term-new=fixed --term-old=broken
git bisect fixed
git bisect broken 6c94774
and then use:
git fixed / git broken
instead of:
git good / git bad
c. complications
There are sometimes other complications, like when different revisions' dependencies aren't the same and for example one revision may require numpy=1.25
and the other numpy=1.26
. If the dependency package versions are backward compatible installing the newer version should do the trick. But that's not always the case. So sometimes one has to reinstall the right dependencies before re-testing the program.
Sometimes, it helps when there is a range of commits that are actually broken in a different way, you can either find a range of good...bad
commits that isn't including the other bad range, or you can try to git bisect skip
the other bad commits as explained earlier.