- Notifications
You must be signed in to change notification settings - Fork842
Rebase is getting fast!#49
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
I made some performance improvements recently (particularly94e03f5,f084a05,c058ffe), which made rebasing reasonably fast in most cases. For example, rebasing 1272 commits in the git.git repo now takes 885 ms. That was a rebase that didn't rewrite any trees (I ran Changing the description of v1.8.5 in the git.git repo results in >30k commits getting rebased. That takes ~16 seconds. For a more direct comparison, I tried rewriting git.git's "todo" branch. Rebasing the 1351 linear commits in ee2122992937..todo there took 644 ms with Rebasing is still somewhat slow on merge commits when trees need to be merged. For example, rebasing the 505 descendants of v2.34.0 in the git.git repo takes almost 13 seconds. The vast majority of that time is spent on merge commits. The problem is that merge commits are rebased by first re-merging their parents and then calculating the difference compared to that. I plan to fix that by keeping the re-merged parent tree in some persistent cache. That cache will also be used for e.g. |
BetaWas this translation helpful?Give feedback.
All reactions
🎉 3🚀 2
Replies: 2 comments 3 replies
-
@Alphare: I know you told me to remember that speed is a feature when I presented this project to you a long time ago :) |
BetaWas this translation helpful?Give feedback.
All reactions
-
I'm glad to see speed is a first-class concern! The numbers sure sound nice. :) |
BetaWas this translation helpful?Give feedback.
All reactions
-
@newren: I know you've spent a lot of time on Git's "ort" strategy. Some replies to your merge-tree patches today made me think that you might be interested in the above. I assume Git can eventually get equally fast. I actually thought To be clear, I don't have support forrename detection, so I won't pretend that |
BetaWas this translation helpful?Give feedback.
All reactions
-
The rebase code in Git, found in sequencer.c + builtin/rebase.c, still has various vestiges of its transliteration from shell script. One of those is that the code assumes at every step that one has a full working tree and index. (Although, to be fair, the merge machinery it was using at the time it was written was designed such that worktree and index updates were intrinsic to the code and shotgun blasted everywhere, so assuming the merge would update the working tree wasn't a bad assumption back then.) Thus, it couldn't actually make use of the merge-ort API that provides just a toplevel tree object without significant rework; it needs there to be a full working tree and index at each step of the operation. So, for each commit being rebased, after it calls the merge-ort API for just computing the merge result and getting a new toplevel tree, it then immediately calls the API for updating the working tree and index. Also related to this, is the fact that it writes out all the control files to disk after every commit, rather than waiting for it to hit a conflict. You'd think it'd only need a few control files under .git/rebase-merge/ to allow resuming from hitting a conflict, but that's thinking in terms of what makes sense from a clean implementation. Rebase was originally a shell script, and parsing files in shell is painful. So every bit of control metadata useful to any particular set of rebase flags was usually a separate control file. And, of course, when switching from shell to C, we have this nice existing regression testsuite, and the best way to pass it is to maintain backward compatibility. So we're writing a lot of control files with every commit being picked too. But the insanity doesn't end there. Given that there were piles and piles git commands that had to be run for each commit being picked back when git-rebase was a shell script, on top of a slow merging backend in various cases, it made sense to add special case code to avoid work when possible. So, for example, rebase has some code to pre-emptively check if any of the commits being rebased happen to be a clean cherry-pick of something in the upstream branch newer than the commits being picked. That means pairwise comparing an awful lot of trees if history has diverged significantly. This probably saved time in the past, but has been measured and shown to be a significant performance overhead nowadays. Sadly, we can't just drop it, because it logically changes the results you get (trying to reapply a cherry pick that might be upstream already might just cleanly result in noticing the patch makes no changes and dropping it, but it might not cleanly reapply and thus throw conflicts at the user). Thus, we had to add a --reapply-cherry-picks option to rebase and mark it as important for getting good performance. Sadly, it was original identified as important for performancewithin partial clones (seegit/git@0fcb4f6), but subsequent measurements show it is just causing problems for rebases in general (see e.g.https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/) There is also another revision walk done, I believe as an optimization ported from shell, in order to check if we can just fast-forward. But my measurements showed that while I'm sure it was a big perf win for the original shell script, it actually hurts performance in the C implementation when rebasing across a large portion of history. Even if we kept it, I think I noted at some point that it had multiple calls to merge-base computations, which seems excessive to me though I didn't look closely. The transliteration from shell still shows through in a few places too. For example, for each new commit it wants to create, why use some internal API when you can fork a full "git commit" process? One of the benefits of forking another process to create the commit, is that you then need to afterwards discard and re-read the index to make sure it's still consistent with what's on disk. "git checkout" for switching to the appropriate branch to start the rebasing process is also a forked subprocess for no real reason. There are also little things like running So, yeah, git rebase still haslots of room for improvement. The actual merges to generate the new trees are not even close to being the primary cost. It's on my radar (and that of a few other folks), but no promises about if or when these things will be cleaned up. |
BetaWas this translation helpful?Give feedback.
All reactions
-
Whoa, thanks for the extremely detailed reply!
Ah, of course, I should have thought of that. The rebase code is one of few areas of Git's codebase I've worked on. I've also spent months cleaning up Mercurial's rebase code. That has an "in-memory" (i.e. not-touching-working-copy) mode since a few years (not my work, though). However, it still falls back to using the working copy if there are conflicts (the whole rebase operation restarts when that happens). Much of the problems you describe sound familiar from Mercurial's transition. I had worried (for Git's sake) that preserving the behavior of hooks (such as the post-checkout hook) would be an issue, but I was told that the current behavior is not documented or guaranteed (IIUC), so that's nice. Interesting about the
Heh. I've heard about Git's frequent subprocessing in our weekly "Git chalk talks" at work.
And I assume you eventually shouldn't have to switch branch until possibly right at the end once everything is done "repo-first".
I completely understand. Thanks for your work so far! |
BetaWas this translation helpful?Give feedback.