How does git handle moving files in the file system?

johnbakers Published at Dev

johnbakers

If I move files within a repository, such as from one folder to another, will git be smart enough to know that these are the same files and merely updates its reference to these files in the repository, or does the new commit actually create copies of these files?

I ask because I wonder how useful git is for storage of binary files. If it treats moved files as copies, then you could have a repo easily get very large even though you didn't actually add any new files.

torek

To understand how git handles these, you need to know two things to start with:

Each individual file (in any directory, in any commit) is stored individually, always.
But it's stored by its object-ID, which is unique for whatever data is in the file.

How things are stored

Let's say you have a new repo with one huge file in it:

$ mkdir temp; cd temp; git init
$ echo contents > bigfile; git add bigfile; git commit -m initial
[master (root-commit) d26649e] initial
 1 file changed, 1 insertion(+)
 create mode 100644 bigfile

The repo now has one commit, which has one tree (the top level directory), which has one file, which has some unique object-ID. (The "big" file is a lie, it's quite small, but it would work the same if it were many megabytes.)

Now if you copy the file to a second version and commit that:

$ cp bigfile bigcopy; git add bigcopy; git commit -m 'make a copy'
[master 971847d] make copy
 1 file changed, 1 insertion(+)
 create mode 100644 bigcopy

the repository now has two commits (obviously), with two trees (one for each version of the top level directory), and one file. The unique object-ID is the same for both copies. To see this, let's view the latest tree:

$ git cat-file -p HEAD:
100644 blob 12f00e90b6ef79117ce6e650416b8cf517099b78    bigcopy
100644 blob 12f00e90b6ef79117ce6e650416b8cf517099b78    bigfile

That big SHA-1 12f00e9... is the unique ID for the file contents. If the file really were enormous, git would now be using about half as much repo space as the working directory, because the repo has only one copy of the file (under the name 12f00e9...), while the working directory has two.

If you change the file contents, though—even one single bit, like making a lowercase letter uppercase or something—then the new contents will have a new SHA-1 object-ID, and need a new copy in the repo. We'll get to that in a bit.

Dynamic rename detection

Now, suppose you have a more complicated directory structure (a repo with more "tree" objects). If you shuffle files around, but the contents of the "new" file(s)—under whatever name(s)—in new directories are the same as the contents that used to be in old ones, here's what happens internally:

$ mkdir A B; mv bigfile A; mv bigcopy B; git add -A .
$ git commit -m 'move stuff'
[master 82a64fe] move stuff
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename bigfile => A/bigfile (100%)
 rename bigcopy => B/bigcopy (100%)

Git has detected the (effective) rename. Let's look at one of the new trees:

$ git cat-file -p HEAD:A
100644 blob 12f00e90b6ef79117ce6e650416b8cf517099b78    bigfile

The file is still under the same old object-ID, so it's still only in the repo once. It's easy for git to detect the rename, because the object-ID matches, even though the path name (as stored in these "tree" objects) might not. Let's do one last thing:

$ mv B/bigcopy B/two; git add -A .; git commit -m 'rename again'
[master 78d92d0] rename again
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename B/{bigcopy => two} (100%)

Now let's ask for a diff between HEAD~2 (before any renamings) and HEAD (after renaming):

$ git diff HEAD~2 HEAD
diff --git a/bigfile b/A/bigfile
similarity index 100%
rename from bigfile
rename to A/bigfile
diff --git a/bigcopy b/B/two
similarity index 100%
rename from bigcopy
rename to B/two

Even though it was done in two steps, git can tell that to go from what was in HEAD~2 to what is now in HEAD, you can do it in one step by renaming bigcopy to B/two.

Git always does dynamic rename detection. Suppose that instead of doing renames, we'd removed the files entirely at some point, and committed that. Later, suppose put the same data back (so that we got the same underlying object IDs), and then diffed a sufficiently old version against the new one. Here git would say that to go directly from the old version to the newest, you could just rename the files, even if that's not how we got there along the way.

In other words, the diff is always done commit-pair-wise: "At some time in the past, we had A. Now we have Z. How do I go directly from A to Z?" At that time, git checks for the possibility of renames, and produces them in the diff output as needed.

What about small changes?

Git will still (sometimes) show renames even if there's some small change to a file's contents. In this case, you get a "similarity index". Basically, you can tell git that given "some file deleted in rev A, some differently-named file added in rev Z" (when diffing revs A and Z), it should try diffing the two files to see if they're "close enough". If they are, you'll get a "file renamed and then changed" diff. The control for this is the -M or --find-renames argument to git diff: git diff -M80 says to show the change as rename-and-edit if the files are at least "80% similar".

Git will also look for "copied then changed", with the -C or --find-copies flag. (You can add --find-copies-harder to do a more computationally-expensive search against all files; see the documentation.)

This relates (indirectly) to how git keeps repositories from blowing up in size over time, as well.

Delta compression

If you have a large file (or even a small file) and make a small change in it, git will store two complete copies of the file, using those object-IDs. You find these things in .git/objects; for instance, that file whose ID is 12f00e90b6ef79117ce6e650416b8cf517099b78 is in .git/objects/12/f00e90b6ef79117ce6e650416b8cf517099b78. They're compressed to save space, but even compressed, a big file can still be pretty big. So, if the underlying object is not very active and appears in a lot of commits with only a few small changes every now and then, git has a way to compress the modifications even further. It puts them into "pack" files.

In a pack file, the object gets further compressed by comparing it to other objects in the repository.¹ For text files it's simple to explain how this works (although the delta compression algorithm is different): if you had a long file and removed line 75, you could just say "use that other copy we have over there, but remove line 75." If you added a new line, you could say "use that other copy, but add this new line." You can express large files as sequences of instructions, using other large files as the basis.

Git does this sort of compression for all objects (not just files), so it can compress a commit against another commit, or trees against each other, too. It's really quite efficient, but with one problem.

The binary file problem

Some (not all) binary files delta-compress very badly against each other. In particular, with a file that is compressed with something like bzip2, gzip, or zip, making one small change anywhere tends to change the rest of the file as well. Images (jpg's, etc) are often compressed and suffer from this sort of effect. (I don't know of many uncompressed image formats. PBM files are completely uncompressed, but that's the only one I know of off-hand that is still in use.)

If you make no changes at all to binary files, git compresses them super-efficiently because of the unchanging low-level object-IDs. If you make small changes, git's compression algorithms can (not necessarily "will") fail on them, so that you get multiple copies of the binaries. I know that large gzip'ed cpio and tar archives do very badly: one small change to such a file and a 2 GB repo becomes a 4 GB repo.

Whether your particular binaries compress well or not is something you'd have to experiment with. If you're just renaming the files, you should have no problem at all. If you're changing large JPG images often, I would expect this to perform poorly (but it's worth experimenting).

¹In "normal" pack files, an object can only be delta-compressed against other objects in the same pack file. This keeps the pack files stand-alone, as it were. A "thin" pack can use objects not in the pack-file itself; these are meant for incremental updates over networks, for instance, as with git fetch.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-6

Comments

0 comments

From Dev

Related Related

Article