Git object basics I

Let us think about Git in a simplified manner, namely looking at it as if we're just maintaining a filesystem state.

Git object basics I
Photo by Marco Bianchetti / Unsplash

Let us think about Git in a simplified manner, namely looking at it as if we're just maintaining a filesystem state. Let us consider that we have the a simple directory to track using git:

In terms of internal git objects we'll expose then 3 types that we can use to represent this structure:

  • BLOB: a binary representation of the contents of the file
  • TREE: a representation of a binary listing of blobs and other trees
  • COMMIT: a snapshot of the working tree

In the following sections we'll explore briefly what are these entities.

trees 🌲

A Git tree object represents a directory. It essentially functions as a description of the file system hierarchy at a particular moment in time. A tree object can reference other trees (subdirectories) and blobs (files), mimicking the structure of a file system directory. The tree objects are identified by their SHA-1 hash value, that is calculated based on:

  • the contents of the tree object
  • the names, permissions, ... and subsequent SHA-1 hashed identifiers of the included blobs and other trees.

For our repo we can check the contents of the tree as follows:

jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

And for constructing the actual full structure of our repository we can as well check the tree from our docs directory by using the following syntax within the same reference:

jose@syn:~/src/gitint$ git ls-tree HEAD:docs
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	me.jpeg

But the same can be obtained by providing the tree identifying SHA-1 hash:

jose@syn:~/src/gitint$ git ls-tree bb6e0e61969fafcc94435ace3fa644237667bacb
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	me.jpeg

With the above object representation we can then reconstruct the abstracted directory listing of both trees and blobs from our repo.

blobs 🩻

A git blob (binary large object) is a fundamental data structure in git. It is used as the storage of a file in a repository, at a given time, and once created it is immutable. Unlike a file in a filesystem, a blob does not contain any kind of metadata relevant for git and it is identified by its own SHA-1 hash.

As we saw above, we can obtain the blob SHA-1 hashes of the files by recurring thru their definition in the tree objects. Let us obtain the list of blobs on our repo by:

jose@syn:~/src/gitint$ git ls-tree -r -l HEAD
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239      16	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 1713408	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c      26	readme.md

So in order for us to actually check the binary representation of a file, as git abstracts it, we can then:

jose@syn:~/src/gitint$ git cat-file -p fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c | hexdump -C
00000000  57 65 6c 63 6f 6d 65 20  74 6f 20 6d 79 20 73 61  |Welcome to my sa|
00000010  6d 70 6c 65 20 72 65 70  6f 0a                    |mple repo.|

commit 📸

A git commit is an object that represents a snapshot of the full working tree at a given point in time, along with its content. It contains two main sections:

  • a pointer to the main tree, or our "root directory"
  • a metadata section that includes things like the commit author's name, the commit time, the commit message and usually one or a set of parent commit objects (or snapshots) of our structure on a given time.

As expected by now, the commit object is also identified by its SHA-1 hash. This is usually what we see when we check the git log. One must note that a git commit includes the entire snapshot of the objects at that point in time, not only the diffs from the
committer.

Above we mentioned about how blobs are immutable. This means that upon a change on a certain file, and when making a new commit, only the new modified file will have a different SHA-1 hash. For the ones that are kept the same (as the hash is not changing), nothing is changed within the tree references. This is how, within a commit, git optimizes the materialized changes on the filesystem: only updating what is in fact changed and keeping the blobs from parent commits that are not modified (whose SHA-1 hash is the same).

Let us recall the blob info of docs/bio.txt file:

jose@syn:~/src/gitint$ git cat-file -p 837c50063e6a1a55d8862632cb122da0d42b8239 | hexdump -C
00000000  48 45 4c 4c 4f 20 49 20  61 6d 20 6a 6f 73 65 0a  |HELLO I am jose.|

Our root tree:

jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

And our blobs within (recursively from the root)

jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

We'll now modify the file and commit the changes:

jose@syn:~/src/gitint$ echo "My spoon is too big" >> docs/bio.txt 
jose@syn:~/src/gitint$ git commit -am "new bio version"
[main d1f46f8] new bio version
 1 file changed, 1 insertion(+)

Now listing our tree we get a different hash:

jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree af1ef4ba103d5da65db58bedae6ffa151faa6017	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

So the updated af1ef4ba103d5da65db58bedae6ffa151faa6017 is a different computed value than the one we got before the commit (as bb6e0e61969fafcc94435ace3fa644237667bacb), this is of course because our blob for that specific file also changed:

jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 7c06157b32c6f934709ae6a95ec46118cf5c0f7d	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

But if we compare with the blob hashes before the commit we verify that only the docs/bio.txt file has changed. This is then how git optimizes file system storage, by only updating within each snapshot (commit) the objects that are changes, and keeping the pointer the same for what is not touched.

Besides information about the blobs, one of the meta data of a commit is as well the parent commit that it was based on. This is also present on every commit. Let us consider our current git log:

jose@syn:~/src/gitint$ git log
commit d1f46f80cea9e965ccc5c1d372461dd124ec1712 (HEAD -> main)
Author: Jose Alves <jose.tapadas@gmail.com>
Date:   Thu Mar 7 22:05:58 2024 +0000

    new bio version

commit 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55
Author: Jose Alves <jose.tapadas@gmail.com>
Date:   Thu Mar 7 18:12:37 2024 +0000

    initial commit for the git internals repo

We can see that the initial commit has a nil parent:

jose@syn:~/src/gitint$ git show --format="%P" 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55 -s

But our subsequent commit, currently at HEAD, showcases what where its parent(s):

jose@syn:~/src/gitint$ git show --format="%P" d1f46f80cea9e965ccc5c1d372461dd124ec1712 -s
8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55

Another trivial detail may be reminded that:

  • the same exact changes on a specific file, by multiple authors, will still get the same blob and SHA-1 hash value
  • the commit will always be different as not only the blob info but as well the metadata associated with the commit (author data, time, ...) will also contribute to generate a new hash

wrap up

This is then as simple exposure of three basic and root objects on the git representation of a filesystem evolution through time, taking a "snapshot" (via commits) of a linked list of trees that reference other trees or the blobs for the files that are included and identified by their SHA-1 hashes that trace every change made on them.