Creating a git repo from scratch: plumber style

Have you ever wanted to create a new git repo from scratch using git low level plumbing commands? Of course not and that is why this small article will showcase it how Mario would do it.

Photo by Sahand Babali / Unsplash

So in a previous article we have dwelled into the basics of git internals and concluded in a succint way that we have the following main entities:

blob: a binary representation of the contents of a file
tree: a representation of a directory listing consiting of blobs and other trees
commit: a snapshot fo the working tree

And in this new entry we will be looking how do we create a repository, a commit and a branch, from scratch, without using any of the fancy and user friendly commands like git init, git add, git commit, etc... just because we can and because it is sunday and the kid is asleep.

As a small remark, the failed pun regardung plummers is actuallly a git concept that distinguish between what is referred as the porcelain commands (the user friendly commands that interface with git internals) and the plumber commands (the internal low level commands). This is of course an ever sadder pun to the actual works around a toilet as we are usually looking at things in the ceramic porcelain but when we need to deeply fix something (omg), we should go into the plumbing. For more context one may refer to: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

So what is a repo anyway?

We can look at a usual project managed by git as containing the following 3 entities:

a working directory on our filesystem with its inherent directory and file structure
our repository that is basically a set of commit objects that hold the representation of our working directory at a certain point in time
our index (or commonly referred as "staging are") that is the materialization of the actual changed files on a specific point in time. A bit like our "database of binaries".

A git repository is then a collection of objects and a system to name and reference those objects, usually referred as refs.

We intend, without using the benefits of our user friendly common commands, to create a repo from scratch, using only plumbing commands in order to graps a bit more of the internals of git.

Let'sa go build the structure

Let us then go down the pipe and create a new directory to hold our new repository and ask git what it thinks about it:

$ mkdir gitrepo
$ cd gitrepo/
$ git status
fatal: not a git repository (or any of the parent directories): .git

So we have created our folder but git is not looking at it as a repository (as expected). Here would be where we would use git init but that is out of reach for us today. As we mentioned before, a git repo is basically a set of commit objects and a set of references to them. You may have already looked that a git repo contains a very specific folder named .git so let us now create all of this structure that we're talking about:

$ mkdir .git
$ mkdir .git/objects
$ mkdir -p .git/refs/heads
$ tree .git/
.git/
├── objects
└── refs
    └── heads

4 directories, 0 files

So above we now have the usual structure: a folder to hold our git objects and a set of references. The heads folder will hold a special kind of reference that we usually refer as branches. Based on this can you already infer what is the nature of a branch in this mix? A branch then is nothing more than a named reference to a specific commit.

Asking git about the state now that we created the base structure:

$ git status
fatal: not a git repository (or any of the parent directories): .git

Nope, it is still not happy. This is because git itself does not know what to look for. By default the behaviour is to find a commit, any commit (which we don't have) and it usually uses a base pointer, a base reference, named HEAD to look for it. And what is a HEAD? Just a simple file that will point to a specific reference. Let us now create it:

$ echo "ref: refs/heads/master" > .git/HEAD
$ tree .git/
.git/
├── HEAD
├── objects
└── refs
    └── heads

So now we have created our new reference in HEAD stating to where it was pointing to and we can verify what git thinks about it:

$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Yup, now git, looking into HEAD, knows where we are now. Still if we try to check our git log:

$ git log
fatal: your current branch 'master' does not have any commits yet

As expected it just provides a fatal error as we lack any kind of commit. As we mentioned before the refs (and branches included) are just named references to a kind of git objects named commits and that a commit is a file that contains either a tree or a blob so let us try to create a new binary blob to our object database (the index):

$ echo "Mario" | git hash-object --stdin -w
5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd

We use the git-hash-object (just man it for more), taking the content from stdin and we write the content to our objects list:

$ tree .git
.git
├── HEAD
├── objects
│   └── 5e
│       └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
└── refs
    └── head

As we can see the binary blob is now added to our .git/objects folder. The subfolder starting with the first byte 5e is just an optimization that git uses for looking up files based on the first byte of their sha-1 hashe value. We can verify the file by using the git-cat-file command for the returned hash:

$ git cat-file -t 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
blob
$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
Mario

So we can verify that that specific hashed object is in fact a blob and we can even see its contents. Let us check the status:

$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Yup nothing there. No changes nor commits. This is because the file was not added to our "binary database" of tracked changes that we called index above. Let us then use the command git-update-index to update the index with this new blob:

$ git update-index --add --cacheinfo 100644 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd itsame.txt

$ tree .git
.git
├── HEAD
├── index
├── objects
│   └── 5e
│       └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
└── refs
    └── heads

5 directories, 3 files

We can see that the command has now created a new file named .git/index (our object database). In case you are wondering about that magic number 100644, it simply refers to this blob as a normal file. If you're interested the docs about objects state that:

In this case, you’re specifying a mode of 100644, which means it’s a normal file. Other options are 100755, which means it’s an executable file; and 120000, which specifies a symbolic link.

Let us check our status:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   itsame.txt

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  deleted:    itsame.txt

Now this is a quite interesting output:

we see our itsame.txt file added to our index (that before this article we used to call "staging area")
we have the same file being marked as deleted

This is because our binary data is stored on index but not on the actual filesystem, therefore, and due to this difference, git assumes the file was deleted at this point in time. Let us then create the file, with the same content of the blob, using the git-cat-file command:

$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd > itsame.txt
$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   itsame.txt

And now we have the proper change being tracked as something to be commited to.

Building the commit

At this point we can then create our snapshot of the contents of our project by creating a commit object. We won't be using of course the porcelain git-commit command.

As we mentioned above a commit is in an object that points to a tree, representing our project in a specific moment in time. Let us then start by writing a tree file with our current state using the command git-write-tree:

$ git write-tree
88e009ef712773b75ffdfba27ea1a87858f16a4a
jose@syn:~/src/gitrepo$ tree .git/
.git/
├── HEAD
├── index
├── objects
│   ├── 5e
│   │   └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
│   └── 88
│       └── e009ef712773b75ffdfba27ea1a87858f16a4a
└── refs
    └── heads

6 directories, 4 files

We can then see the object is created and added to our .git/objects structure. We can as well verify what it is:

$ git cat-file -t 88e009ef712773b75ffdfba27ea1a87858f16a4a
tree
$ git cat-file -p 88e009ef712773b75ffdfba27ea1a87858f16a4a
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd	itsame.txt

From here we see that it is in fact..a tree and it contains the type (regular file blob), the hash and the name of the file that is present on our project. Again, please be reminded that a tree is nothing more than an object with the representation of our working directory via referencing both blobs of files and other trees of directories (please refer to the previous article about git object basics for a refresh if this is not clear by now).

Now that we have a tree we can proceed on creating a commit object that points to that it, using the command git-commit-tree:

$ git commit-tree 88e009ef712773b75ffdfba27ea1a87858f16a4a -m "initial commit"
f81a29be9727635c552d3c16fc541fed104b3e24

$ tree .git/
.git/
├── config
├── HEAD
├── index
├── objects
│   ├── 5e
│   │   └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
│   ├── 88
│   │   └── e009ef712773b75ffdfba27ea1a87858f16a4a
│   └── f8
│       └── 1a29be9727635c552d3c16fc541fed104b3e24
└── refs
    └── heads

7 directories, 6 files

We can see that the commit object is created and added to our .git/objects directory as well, as any other git object. Checking the content with this object for the more curious or inquisitive:

$ git cat-file -t f81a29be9727635c552d3c16fc541fed104b3e24
commit
$ git cat-file -p f81a29be9727635c552d3c16fc541fed104b3e24
tree 88e009ef712773b75ffdfba27ea1a87858f16a4a
author jose <jose@tapadas.dev> 1710685680 +0000
committer jose <jose@tapadas.dev> 1710685680 +0000

initial commit

We can see there that the type of the object is in fact a commit, we can see the tree that it refers to and more information about the commiter as well as the message. Checking then our status:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   itsame.txt

Nothing still happened. This is mostly because git has no idea where the commit is. It does know that HEAD is pointing to some ref, representing our master branch but it cannot make the connection between that and the actual commit object we have created. To do so, let us then add that commit hash to our master branch reference:

$ echo f81a29be9727635c552d3c16fc541fed104b3e24 > .git/refs/heads/master
$ git status
On branch master
nothing to commit, working tree clean

This is what made git have a notion that our changes on that commit are in fact in the same state to where our HEAD is pointing to, and looking into our log:

$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -> master)
Author: jose <jose@tapadas.dev>
Date:   Sun Mar 17 14:28:00 2024 +0000

    initial commit

We have then succesfully created our base repository.

shall we branch it?

So we have seen that a branch is nothing else than another reference to a commit, within .git/refs/heads. Let us then create a new branch without using the git-branch command:

$ cat .git/refs/heads/master > .git/refs/heads/feature
$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -> master, feature)
Author: jose <jose@tapadas.dev>
Date:   Sun Mar 17 14:28:00 2024 +0000

    initial commit

We can see from our log that HEAD is pointing to both master, and that feature is on the same point in history. So checking out to the feature branch should be then as easy as updating our HEAD base reference:

$ echo "ref: refs/heads/feature" > .git/HEAD
$ git status
On branch feature
nothing to commit, working tree clean

As we can see we are now on that branch, as HEAD points to that reference.

Let us now create a file the same way we did before in order to observe how the divergence of paths is represented using this structure:

$ echo "Luigi" | git hash-object --stdin -w
cac2640e641943f5b71bd076589dce53923ad7bd

$ git cat-file -p cac2640e641943f5b71bd076589dce53923ad7bd > sidekick.txt

We then can add it to our index:

$ git update-index --add --cacheinfo 100644 cac2640e641943f5b71bd076589dce53923ad7bd sidekick.txt
$ git status
On branch feature
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   sidekick.txt

We are now able to trivialy create a commit for this feature by first creating a tree:

$ git write-tree
4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
$ git cat-file -p 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd	itsame.txt
100644 blob cac2640e641943f5b71bd076589dce53923ad7bd	sidekick.txt

There we can see above that we have now two blobs on this snapshot in time, so we just commit it but now specifying a (-p) parent commit as our initial commit:

$ git commit-tree 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17 -m "adding a friend for mario" -p f81a29be9727635c552d3c16fc541fed104b3e24
6c06460452f79dae525c00b7deb838eb5987d5dc

Looking at git-log won't be doing us any good if we don't update our branch ref to point to this new commit:

$ echo 6c06460452f79dae525c00b7deb838eb5987d5dc > .git/refs/heads/feature

And if we now look at the log we can see that we have now properly created a divergence that is being tracked correctly by git:

ose@syn:~/src/gitrepo$ git log
commit 6c06460452f79dae525c00b7deb838eb5987d5dc (HEAD -> feature)
Author: jose <jose@tapadas.dev>
Date:   Sun Mar 17 14:46:39 2024 +0000

    adding a friend for mario

commit f81a29be9727635c552d3c16fc541fed104b3e24 (master)
Author: jose <jose@tapadas.dev>
Date:   Sun Mar 17 14:28:00 2024 +0000

We plubmed our way into the actual inner management that git does to track its objects via references.

wrap up

After this small article we have now explored and expanded the initial experience of using git objects and verifying how a git repository is nothing more than a collection of objects and a system of referring to them via refs.