Git object basics I
Let us think about Git in a simplified manner, namely looking at it as if we're just maintaining a filesystem state.
Let us think about Git in a simplified manner, namely looking at it as if we're just maintaining a filesystem state. Let us consider that we have the a simple directory to track using git:
In terms of internal git objects we'll expose then 3 types that we can use to represent this structure:
- BLOB: a binary representation of the contents of the file
- TREE: a representation of a binary listing of blobs and other trees
- COMMIT: a snapshot of the working tree
In the following sections we'll explore briefly what are these entities.
trees 🌲
A Git tree object represents a directory. It essentially functions as a description of the file system hierarchy at a particular moment in time. A tree object can reference other trees (subdirectories) and blobs (files), mimicking the structure of a file system directory. The tree objects are identified by their SHA-1 hash value, that is calculated based on:
- the contents of the tree object
- the names, permissions, ... and subsequent SHA-1 hashed identifiers of the included blobs and other trees.
For our repo we can check the contents of the tree as follows:
jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c readme.md
And for constructing the actual full structure of our repository we can as well check the tree
from our docs
directory by using the following syntax within the same reference:
jose@syn:~/src/gitint$ git ls-tree HEAD:docs
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239 bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 me.jpeg
But the same can be obtained by providing the tree
identifying SHA-1 hash:
jose@syn:~/src/gitint$ git ls-tree bb6e0e61969fafcc94435ace3fa644237667bacb
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239 bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 me.jpeg
With the above object representation we can then reconstruct the abstracted directory listing of both trees
and blobs
from our repo.
blobs 🩻
A git blob (binary large object) is a fundamental data structure in git. It is used as the storage of a file in a repository, at a given time, and once created it is immutable. Unlike a file in a filesystem, a blob does not contain any kind of metadata relevant for git and it is identified by its own SHA-1 hash.
As we saw above, we can obtain the blob SHA-1 hashes of the files by recurring thru their definition in the tree
objects. Let us obtain the list of blobs on our repo by:
jose@syn:~/src/gitint$ git ls-tree -r -l HEAD
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239 16 docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 1713408 docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c 26 readme.md
So in order for us to actually check the binary representation of a file, as git abstracts it, we can then:
jose@syn:~/src/gitint$ git cat-file -p fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c | hexdump -C
00000000 57 65 6c 63 6f 6d 65 20 74 6f 20 6d 79 20 73 61 |Welcome to my sa|
00000010 6d 70 6c 65 20 72 65 70 6f 0a |mple repo.|
commit 📸
A git commit is an object that represents a snapshot of the full working tree at a given point in time, along with its content. It contains two main sections:
- a pointer to the main
tree
, or our "root directory" - a metadata section that includes things like the commit author's name, the commit time, the commit message and usually one or a set of parent
commit
objects (or snapshots) of our structure on a given time.
As expected by now, the commit
object is also identified by its SHA-1 hash. This is usually what we see when we check the git log
. One must note that a git commit includes the entire snapshot of the objects at that point in time, not only the diffs from the
committer.
Above we mentioned about how blobs
are immutable. This means that upon a change on a certain file, and when making a new commit, only the new modified file will have a different SHA-1 hash. For the ones that are kept the same (as the hash is not changing), nothing is changed within the tree references. This is how, within a commit, git optimizes the materialized changes on the filesystem: only updating what is in fact changed and keeping the blobs from parent commits that are not modified (whose SHA-1 hash is the same).
Let us recall the blob info of docs/bio.txt
file:
jose@syn:~/src/gitint$ git cat-file -p 837c50063e6a1a55d8862632cb122da0d42b8239 | hexdump -C
00000000 48 45 4c 4c 4f 20 49 20 61 6d 20 6a 6f 73 65 0a |HELLO I am jose.|
Our root tree
:
jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c readme.md
And our blobs within (recursively from the root)
jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239 docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c readme.md
We'll now modify the file and commit the changes:
jose@syn:~/src/gitint$ echo "My spoon is too big" >> docs/bio.txt
jose@syn:~/src/gitint$ git commit -am "new bio version"
[main d1f46f8] new bio version
1 file changed, 1 insertion(+)
Now listing our tree
we get a different hash:
jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree af1ef4ba103d5da65db58bedae6ffa151faa6017 docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c readme.md
So the updated af1ef4ba103d5da65db58bedae6ffa151faa6017
is a different computed value than the one we got before the commit (as bb6e0e61969fafcc94435ace3fa644237667bacb
), this is of course because our blob for that specific file also changed:
jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 7c06157b32c6f934709ae6a95ec46118cf5c0f7d docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c readme.md
But if we compare with the blob hashes before the commit we verify that only the docs/bio.txt
file has changed. This is then how git optimizes file system storage, by only updating within each snapshot (commit) the objects that are changes, and keeping the pointer the same for what is not touched.
Besides information about the blobs, one of the meta data of a commit
is as well the parent commit that it was based on. This is also present on every commit. Let us consider our current git log:
jose@syn:~/src/gitint$ git log
commit d1f46f80cea9e965ccc5c1d372461dd124ec1712 (HEAD -> main)
Author: Jose Alves <jose.tapadas@gmail.com>
Date: Thu Mar 7 22:05:58 2024 +0000
new bio version
commit 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55
Author: Jose Alves <jose.tapadas@gmail.com>
Date: Thu Mar 7 18:12:37 2024 +0000
initial commit for the git internals repo
We can see that the initial commit has a nil
parent:
jose@syn:~/src/gitint$ git show --format="%P" 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55 -s
But our subsequent commit, currently at HEAD
, showcases what where its parent(s):
jose@syn:~/src/gitint$ git show --format="%P" d1f46f80cea9e965ccc5c1d372461dd124ec1712 -s
8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55
Another trivial detail may be reminded that:
- the same exact changes on a specific file, by multiple authors, will still get the same blob and SHA-1 hash value
- the commit will always be different as not only the blob info but as well the metadata associated with the commit (author data, time, ...) will also contribute to generate a new hash
wrap up
This is then as simple exposure of three basic and root objects on the git representation of a filesystem evolution through time, taking a "snapshot" (via commits) of a linked list of trees
that reference other trees
or the blobs
for the files that are included and identified by their SHA-1 hashes that trace every change made on them.