Creating a git repo from scratch: plumber style
Have you ever wanted to create a new git repo from scratch using git low level plumbing commands? Of course not and that is why this small article will showcase it how Mario would do it.
So in a previous article we have dwelled into the basics of git internals and concluded in a succint way that we have the following main entities:
- blob: a binary representation of the contents of a file
- tree: a representation of a directory listing consiting of
blobs
and othertrees
- commit: a snapshot fo the working tree
And in this new entry we will be looking how do we create a repository, a commit and a branch, from scratch, without using any of the fancy and user friendly commands like git init
, git add
, git commit
, etc... just because we can and because it is sunday and the kid is asleep.
As a small remark, the failed pun regardung plummers is actuallly a git concept that distinguish between what is referred as the porcelain commands (the user friendly commands that interface with git internals) and the plumber commands (the internal low level commands). This is of course an ever sadder pun to the actual works around a toilet as we are usually looking at things in the ceramic porcelain but when we need to deeply fix something (omg), we should go into the plumbing. For more context one may refer to: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain
So what is a repo anyway?
We can look at a usual project managed by git as containing the following 3 entities:
- a working directory on our filesystem with its inherent directory and file structure
- our repository that is basically a set of commit objects that hold the representation of our working directory at a certain point in time
- our index (or commonly referred as "staging are") that is the materialization of the actual changed files on a specific point in time. A bit like our "database of binaries".
A git repository is then a collection of objects and a system to name and reference those objects, usually referred as refs.
We intend, without using the benefits of our user friendly common commands, to create a repo from scratch, using only plumbing commands in order to graps a bit more of the internals of git.
Let'sa go build the structure
Let us then go down the pipe and create a new directory to hold our new repository and ask git what it thinks about it:
$ mkdir gitrepo
$ cd gitrepo/
$ git status
fatal: not a git repository (or any of the parent directories): .git
So we have created our folder but git is not looking at it as a repository (as expected). Here would be where we would use git init
but that is out of reach for us today. As we mentioned before, a git repo is basically a set of commit objects and a set of references to them. You may have already looked that a git repo contains a very specific folder named .git
so let us now create all of this structure that we're talking about:
$ mkdir .git
$ mkdir .git/objects
$ mkdir -p .git/refs/heads
$ tree .git/
.git/
├── objects
└── refs
└── heads
4 directories, 0 files
So above we now have the usual structure: a folder to hold our git objects and a set of references. The heads
folder will hold a special kind of reference that we usually refer as branches
. Based on this can you already infer what is the nature of a branch in this mix? A branch then is nothing more than a named reference to a specific commit.
Asking git about the state now that we created the base structure:
$ git status
fatal: not a git repository (or any of the parent directories): .git
Nope, it is still not happy. This is because git itself does not know what to look for. By default the behaviour is to find a commit, any commit (which we don't have) and it usually uses a base pointer, a base reference, named HEAD
to look for it. And what is a HEAD
? Just a simple file that will point to a specific reference. Let us now create it:
$ echo "ref: refs/heads/master" > .git/HEAD
$ tree .git/
.git/
├── HEAD
├── objects
└── refs
└── heads
So now we have created our new reference in HEAD
stating to where it was pointing to and we can verify what git thinks about it:
$ git status
On branch master
No commits yet
nothing to commit (create/copy files and use "git add" to track)
Yup, now git, looking into HEAD
, knows where we are now. Still if we try to check our git log:
$ git log
fatal: your current branch 'master' does not have any commits yet
As expected it just provides a fatal error as we lack any kind of commit. As we mentioned before the refs (and branches included) are just named references to a kind of git objects named commits
and that a commit
is a file that contains either a tree
or a blob
so let us try to create a new binary blob
to our object database (the index
):
$ echo "Mario" | git hash-object --stdin -w
5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
We use the git-hash-object
(just man
it for more), taking the content from stdin
and we write the content to our objects
list:
$ tree .git
.git
├── HEAD
├── objects
│ └── 5e
│ └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
└── refs
└── head
As we can see the binary blob is now added to our .git/objects
folder. The subfolder starting with the first byte 5e
is just an optimization that git uses for looking up files based on the first byte of their sha-1 hashe value. We can verify the file by using the git-cat-file
command for the returned hash:
$ git cat-file -t 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
blob
$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
Mario
So we can verify that that specific hashed object is in fact a blob
and we can even see its contents. Let us check the status:
$ git status
On branch master
No commits yet
nothing to commit (create/copy files and use "git add" to track)
Yup nothing there. No changes nor commits. This is because the file was not added to our "binary database" of tracked changes that we called index above. Let us then use the command git-update-index
to update the index with this new blob
:
$ git update-index --add --cacheinfo 100644 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd itsame.txt
$ tree .git
.git
├── HEAD
├── index
├── objects
│ └── 5e
│ └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
└── refs
└── heads
5 directories, 3 files
We can see that the command has now created a new file named .git/index
(our object database). In case you are wondering about that magic number 100644
, it simply refers to this blob as a normal file. If you're interested the docs about objects state that:
In this case, you’re specifying a mode of100644
, which means it’s a normal file. Other options are100755
, which means it’s an executable file; and120000
, which specifies a symbolic link.
Let us check our status:
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: itsame.txt
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: itsame.txt
Now this is a quite interesting output:
- we see our
itsame.txt
file added to our index (that before this article we used to call "staging area") - we have the same file being marked as deleted
This is because our binary data is stored on index but not on the actual filesystem, therefore, and due to this difference, git assumes the file was deleted at this point in time. Let us then create the file, with the same content of the blob, using the git-cat-file
command:
$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd > itsame.txt
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: itsame.txt
And now we have the proper change being tracked as something to be commited to.
Building the commit
At this point we can then create our snapshot of the contents of our project by creating a commit object. We won't be using of course the porcelain git-commit
command.
As we mentioned above a commit is in an object that points to a tree, representing our project in a specific moment in time. Let us then start by writing a tree
file with our current state using the command git-write-tree
:
$ git write-tree
88e009ef712773b75ffdfba27ea1a87858f16a4a
jose@syn:~/src/gitrepo$ tree .git/
.git/
├── HEAD
├── index
├── objects
│ ├── 5e
│ │ └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
│ └── 88
│ └── e009ef712773b75ffdfba27ea1a87858f16a4a
└── refs
└── heads
6 directories, 4 files
We can then see the object is created and added to our .git/objects
structure. We can as well verify what it is:
$ git cat-file -t 88e009ef712773b75ffdfba27ea1a87858f16a4a
tree
$ git cat-file -p 88e009ef712773b75ffdfba27ea1a87858f16a4a
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd itsame.txt
From here we see that it is in fact..a tree
and it contains the type (regular file blob), the hash and the name of the file that is present on our project. Again, please be reminded that a tree
is nothing more than an object with the representation of our working directory via referencing both blobs
of files and other trees
of directories (please refer to the previous article about git object basics for a refresh if this is not clear by now).
Now that we have a tree we can proceed on creating a commit object that points to that it, using the command git-commit-tree
:
$ git commit-tree 88e009ef712773b75ffdfba27ea1a87858f16a4a -m "initial commit"
f81a29be9727635c552d3c16fc541fed104b3e24
$ tree .git/
.git/
├── config
├── HEAD
├── index
├── objects
│ ├── 5e
│ │ └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
│ ├── 88
│ │ └── e009ef712773b75ffdfba27ea1a87858f16a4a
│ └── f8
│ └── 1a29be9727635c552d3c16fc541fed104b3e24
└── refs
└── heads
7 directories, 6 files
We can see that the commit object is created and added to our .git/objects
directory as well, as any other git object. Checking the content with this object for the more curious or inquisitive:
$ git cat-file -t f81a29be9727635c552d3c16fc541fed104b3e24
commit
$ git cat-file -p f81a29be9727635c552d3c16fc541fed104b3e24
tree 88e009ef712773b75ffdfba27ea1a87858f16a4a
author jose <jose@tapadas.dev> 1710685680 +0000
committer jose <jose@tapadas.dev> 1710685680 +0000
initial commit
We can see there that the type of the object is in fact a commit
, we can see the tree
that it refers to and more information about the commiter as well as the message. Checking then our status:
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: itsame.txt
Nothing still happened. This is mostly because git has no idea where the commit is. It does know that HEAD
is pointing to some ref, representing our master
branch but it cannot make the connection between that and the actual commit object we have created. To do so, let us then add that commit hash to our master
branch reference:
$ echo f81a29be9727635c552d3c16fc541fed104b3e24 > .git/refs/heads/master
$ git status
On branch master
nothing to commit, working tree clean
This is what made git have a notion that our changes on that commit are in fact in the same state to where our HEAD
is pointing to, and looking into our log:
$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -> master)
Author: jose <jose@tapadas.dev>
Date: Sun Mar 17 14:28:00 2024 +0000
initial commit
We have then succesfully created our base repository.
shall we branch it?
So we have seen that a branch is nothing else than another reference to a commit, within .git/refs/heads
. Let us then create a new branch without using the git-branch
command:
$ cat .git/refs/heads/master > .git/refs/heads/feature
$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -> master, feature)
Author: jose <jose@tapadas.dev>
Date: Sun Mar 17 14:28:00 2024 +0000
initial commit
We can see from our log that HEAD
is pointing to both master
, and that feature
is on the same point in history. So checking out to the feature
branch should be then as easy as updating our HEAD
base reference:
$ echo "ref: refs/heads/feature" > .git/HEAD
$ git status
On branch feature
nothing to commit, working tree clean
As we can see we are now on that branch, as HEAD
points to that reference.
Let us now create a file the same way we did before in order to observe how the divergence of paths is represented using this structure:
$ echo "Luigi" | git hash-object --stdin -w
cac2640e641943f5b71bd076589dce53923ad7bd
$ git cat-file -p cac2640e641943f5b71bd076589dce53923ad7bd > sidekick.txt
We then can add it to our index:
$ git update-index --add --cacheinfo 100644 cac2640e641943f5b71bd076589dce53923ad7bd sidekick.txt
$ git status
On branch feature
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: sidekick.txt
We are now able to trivialy create a commit for this feature by first creating a tree
:
$ git write-tree
4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
$ git cat-file -p 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd itsame.txt
100644 blob cac2640e641943f5b71bd076589dce53923ad7bd sidekick.txt
There we can see above that we have now two blobs
on this snapshot in time, so we just commit it but now specifying a (-p
) parent commit as our initial commit:
$ git commit-tree 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17 -m "adding a friend for mario" -p f81a29be9727635c552d3c16fc541fed104b3e24
6c06460452f79dae525c00b7deb838eb5987d5dc
Looking at git-log
won't be doing us any good if we don't update our branch ref to point to this new commit:
$ echo 6c06460452f79dae525c00b7deb838eb5987d5dc > .git/refs/heads/feature
And if we now look at the log we can see that we have now properly created a divergence that is being tracked correctly by git:
ose@syn:~/src/gitrepo$ git log
commit 6c06460452f79dae525c00b7deb838eb5987d5dc (HEAD -> feature)
Author: jose <jose@tapadas.dev>
Date: Sun Mar 17 14:46:39 2024 +0000
adding a friend for mario
commit f81a29be9727635c552d3c16fc541fed104b3e24 (master)
Author: jose <jose@tapadas.dev>
Date: Sun Mar 17 14:28:00 2024 +0000
We plubmed our way into the actual inner management that git does to track its objects via references.
wrap up
After this small article we have now explored and expanded the initial experience of using git objects and verifying how a git repository is nothing more than a collection of objects and a system of referring to them via refs.