Understanding Git Garbage Collection

Garbage collection is a well known software practice. The garbage collector, or just collector, attempts to reclaim garbage, or memory occupied by objects that are no longer in use by the program. When we talk about the Git Garbage Collection, we mean almost the same thing. Git garbage collector runs a number of housekeeping tasks within the current repository, such as compressing file revisions (to reduce disk space and increase performance), removing unreachable objects which may have been created from prior invocations of git add, packing refs, pruning reflog or stale working trees.

It is generally recommended to run this task on a regular basis within each repository to maintain good disk space utilization and good operating performance. This is especially important for a repository in active usage by large teams.

Why it is necessary at all?

Imagine this scenario: you write some code and check it in. Then, you realize that it is not working. You debug your code and found you have a syntax error. Then you made changes and check it in. Then you realize that you have not checked for certain conditions. For whatever reason, your work is not complete when it should have been completed.

When I was using Subversion, the only thing that could be done about this was to add the new commit. Even in the usual day to day fixes in Git, I’m more often than not wind up having three to four commits in a row. The first one would usually say, “Did thing X”, and the next several would have either same message for lack of better words or have messages like “Fixed typo with this” or “Added the correct variables” or etc. I mean you got the idea. Sometimes, I use git rebase or git commit –amend to hide my silly typo mistakes because of the embarrassment during pull request reviews.

Some Interesting Facts

As we all have learnt sometime during our learning process, howsoever old may be, that a commit’s ID is a SHA-1 hash of several pieces of information: mainly the contents of the commit, and the IDs of its parent commits among few other things. This essentially means that using git commit –amend, you’re actually building a completely different commit, and pointing your local branch reference to it instead. The first commit made is still there on disk, and you can still get back to it. Same goes for git rebase. However, in the interest of not cluttering up your view, neither git log nor your Git visualizer will show it to you, because it’s not part of the history of something you should care about.

Suppose you did git reset –hard HEAD^ and threw out your last commit. Well, it turns out you really did need those changes. When you do a reset, the commit you threw out goes to a dangling state. It’s still in Git’s datastore, waiting for the next git gc execution to clean it up. So unless you’ve ran a git gc since you removed it, you can find it.

Not only this, time to time branches will be created and deleted. Git Reflog still shows the history of this info. Using this, you can still recreate the branch. So it is still there and not removed.

How to Run Garbage Collector Manually

You can trigger this process yourself, using git gc command. Starting from every branch and every tag, Git walks back through the graph, building a list of every commit it can reach. Once it’s reached the end of every path, it deletes all the commits it didn’t visit.

git gc tries very hard not to delete objects that are referenced anywhere in your repository. In particular, it will keep not only objects referenced by your current set of branches and tags, but also objects referenced by the index, remote-tracking branches, refs saved by git filter-branch in refs/original/, or reflogs (which may reference commits in branches that were later amended or rewound). If you are expecting some objects to be deleted and they aren’t, check all of those locations and decide whether it makes sense in your case to remove those references.

When Should I run Garbage Collector

 

Like with other GC recommendations, there is no definite way to say this. You need to find a balance between when to do this and the performance impact it can have.

Some git commands may automatically run git gc. If you know what you’re doing and all you want is to disable this behavior permanently without further considerations, just do: git config –global gc.auto 0

You can also run git gc –auto command using git web hooks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s