josé tapadas alves

How to sign PDFs in (Ubuntu) Linux

José Tapadas Alves — Fri, 08 Aug 2025 09:46:56 GMT

So I had to sign a PDF in my Linux box, here is a quick guide on how to do it.

Installed a opensource PDF editor, so I opted for Okular.
Then I generated both a private key (pdf_signing.key) and a self-signed certificate (pdf_signing.crt) to use for signing the documents, using openssl:

openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes -keyout pdf_signing.key -out pdf_signing.crt -subj "/CN=Jose Alves/emailAddress=jose@alves.lol"

Convert it to PKCS#12:

openssl pkcs12 -export -in pdf_signing.crt -inkey pdf_signing.key -out signing-certificate.p12 -name "Jose Alves"

Import it to the NSS database (where Okular will search for it):

pk12util -d ~/.pki/nssdb -i signing-certificate.p12

Open Okular and choose Tools > Digitally Sign..

Creating a git repo from scratch: plumber style

José Tapadas Alves — Sun, 17 Mar 2024 14:51:32 GMT

So in a previous article we have dwelled into the basics of git internals and concluded in a succint way that we have the following main entities:

blob: a binary representation of the contents of a file
tree: a representation of a directory listing consiting of blobs and other trees
commit: a snapshot fo the working tree

And in this new entry we will be looking how do we create a repository, a commit and a branch, from scratch, without using any of the fancy and user friendly commands like git init, git add, git commit, etc... just because we can and because it is sunday and the kid is asleep.

As a small remark, the failed pun regardung plummers is actuallly a git concept that distinguish between what is referred as the porcelain commands (the user friendly commands that interface with git internals) and the plumber commands (the internal low level commands). This is of course an ever sadder pun to the actual works around a toilet as we are usually looking at things in the ceramic porcelain but when we need to deeply fix something (omg), we should go into the plumbing. For more context one may refer to: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

So what is a repo anyway?

We can look at a usual project managed by git as containing the following 3 entities:

a working directory on our filesystem with its inherent directory and file structure
our repository that is basically a set of commit objects that hold the representation of our working directory at a certain point in time
our index (or commonly referred as "staging are") that is the materialization of the actual changed files on a specific point in time. A bit like our "database of binaries".

A git repository is then a collection of objects and a system to name and reference those objects, usually referred as refs.

We intend, without using the benefits of our user friendly common commands, to create a repo from scratch, using only plumbing commands in order to graps a bit more of the internals of git.

Let'sa go build the structure

Let us then go down the pipe and create a new directory to hold our new repository and ask git what it thinks about it:

$ mkdir gitrepo
$ cd gitrepo/
$ git status
fatal: not a git repository (or any of the parent directories): .git

So we have created our folder but git is not looking at it as a repository (as expected). Here would be where we would use git init but that is out of reach for us today. As we mentioned before, a git repo is basically a set of commit objects and a set of references to them. You may have already looked that a git repo contains a very specific folder named .git so let us now create all of this structure that we're talking about:

$ mkdir .git
$ mkdir .git/objects
$ mkdir -p .git/refs/heads
$ tree .git/
.git/
├── objects
└── refs
    └── heads

4 directories, 0 files

So above we now have the usual structure: a folder to hold our git objects and a set of references. The heads folder will hold a special kind of reference that we usually refer as branches. Based on this can you already infer what is the nature of a branch in this mix? A branch then is nothing more than a named reference to a specific commit.

Asking git about the state now that we created the base structure:

$ git status
fatal: not a git repository (or any of the parent directories): .git

Nope, it is still not happy. This is because git itself does not know what to look for. By default the behaviour is to find a commit, any commit (which we don't have) and it usually uses a base pointer, a base reference, named HEAD to look for it. And what is a HEAD? Just a simple file that will point to a specific reference. Let us now create it:

$ echo "ref: refs/heads/master" > .git/HEAD
$ tree .git/
.git/
├── HEAD
├── objects
└── refs
    └── heads

So now we have created our new reference in HEAD stating to where it was pointing to and we can verify what git thinks about it:

$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Yup, now git, looking into HEAD, knows where we are now. Still if we try to check our git log:

$ git log
fatal: your current branch 'master' does not have any commits yet

As expected it just provides a fatal error as we lack any kind of commit. As we mentioned before the refs (and branches included) are just named references to a kind of git objects named commits and that a commit is a file that contains either a tree or a blob so let us try to create a new binary blob to our object database (the index):

$ echo "Mario" | git hash-object --stdin -w
5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd

We use the git-hash-object (just man it for more), taking the content from stdin and we write the content to our objects list:

$ tree .git
.git
├── HEAD
├── objects
│   └── 5e
│       └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
└── refs
    └── head

As we can see the binary blob is now added to our .git/objects folder. The subfolder starting with the first byte 5e is just an optimization that git uses for looking up files based on the first byte of their sha-1 hashe value. We can verify the file by using the git-cat-file command for the returned hash:

$ git cat-file -t 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
blob
$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
Mario

So we can verify that that specific hashed object is in fact a blob and we can even see its contents. Let us check the status:

$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Yup nothing there. No changes nor commits. This is because the file was not added to our "binary database" of tracked changes that we called index above. Let us then use the command git-update-index to update the index with this new blob:

$ git update-index --add --cacheinfo 100644 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd itsame.txt

$ tree .git
.git
├── HEAD
├── index
├── objects
│   └── 5e
│       └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
└── refs
    └── heads

5 directories, 3 files

We can see that the command has now created a new file named .git/index (our object database). In case you are wondering about that magic number 100644, it simply refers to this blob as a normal file. If you're interested the docs about objects state that:

In this case, you’re specifying a mode of 100644, which means it’s a normal file. Other options are 100755, which means it’s an executable file; and 120000, which specifies a symbolic link.

Let us check our status:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached ..." to unstage)
	new file:   itsame.txt

Changes not staged for commit:
  (use "git add/rm ..." to update what will be committed)
  (use "git restore ..." to discard changes in working directory)
  deleted:    itsame.txt

Now this is a quite interesting output:

we see our itsame.txt file added to our index (that before this article we used to call "staging area")
we have the same file being marked as deleted

This is because our binary data is stored on index but not on the actual filesystem, therefore, and due to this difference, git assumes the file was deleted at this point in time. Let us then create the file, with the same content of the blob, using the git-cat-file command:

$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd > itsame.txt
$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached ..." to unstage)
	new file:   itsame.txt

And now we have the proper change being tracked as something to be commited to.

Building the commit

At this point we can then create our snapshot of the contents of our project by creating a commit object. We won't be using of course the porcelain git-commit command.

As we mentioned above a commit is in an object that points to a tree, representing our project in a specific moment in time. Let us then start by writing a tree file with our current state using the command git-write-tree:

$ git write-tree
88e009ef712773b75ffdfba27ea1a87858f16a4a
jose@syn:~/src/gitrepo$ tree .git/
.git/
├── HEAD
├── index
├── objects
│   ├── 5e
│   │   └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
│   └── 88
│       └── e009ef712773b75ffdfba27ea1a87858f16a4a
└── refs
    └── heads

6 directories, 4 files

We can then see the object is created and added to our .git/objects structure. We can as well verify what it is:

$ git cat-file -t 88e009ef712773b75ffdfba27ea1a87858f16a4a
tree
$ git cat-file -p 88e009ef712773b75ffdfba27ea1a87858f16a4a
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd	itsame.txt

From here we see that it is in fact..a tree and it contains the type (regular file blob), the hash and the name of the file that is present on our project. Again, please be reminded that a tree is nothing more than an object with the representation of our working directory via referencing both blobs of files and other trees of directories (please refer to the previous article about git object basics for a refresh if this is not clear by now).

Now that we have a tree we can proceed on creating a commit object that points to that it, using the command git-commit-tree:

$ git commit-tree 88e009ef712773b75ffdfba27ea1a87858f16a4a -m "initial commit"
f81a29be9727635c552d3c16fc541fed104b3e24

$ tree .git/
.git/
├── config
├── HEAD
├── index
├── objects
│   ├── 5e
│   │   └── 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
│   ├── 88
│   │   └── e009ef712773b75ffdfba27ea1a87858f16a4a
│   └── f8
│       └── 1a29be9727635c552d3c16fc541fed104b3e24
└── refs
    └── heads

7 directories, 6 files

We can see that the commit object is created and added to our .git/objects directory as well, as any other git object. Checking the content with this object for the more curious or inquisitive:

$ git cat-file -t f81a29be9727635c552d3c16fc541fed104b3e24
commit
$ git cat-file -p f81a29be9727635c552d3c16fc541fed104b3e24
tree 88e009ef712773b75ffdfba27ea1a87858f16a4a
author jose  1710685680 +0000
committer jose  1710685680 +0000

initial commit

We can see there that the type of the object is in fact a commit, we can see the tree that it refers to and more information about the commiter as well as the message. Checking then our status:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached ..." to unstage)
	new file:   itsame.txt

Nothing still happened. This is mostly because git has no idea where the commit is. It does know that HEAD is pointing to some ref, representing our master branch but it cannot make the connection between that and the actual commit object we have created. To do so, let us then add that commit hash to our master branch reference:

$ echo f81a29be9727635c552d3c16fc541fed104b3e24 > .git/refs/heads/master
$ git status
On branch master
nothing to commit, working tree clean

This is what made git have a notion that our changes on that commit are in fact in the same state to where our HEAD is pointing to, and looking into our log:

$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -> master)
Author: jose 
Date:   Sun Mar 17 14:28:00 2024 +0000

    initial commit

We have then succesfully created our base repository.

shall we branch it?

So we have seen that a branch is nothing else than another reference to a commit, within .git/refs/heads. Let us then create a new branch without using the git-branch command:

$ cat .git/refs/heads/master > .git/refs/heads/feature
$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -> master, feature)
Author: jose 
Date:   Sun Mar 17 14:28:00 2024 +0000

    initial commit

We can see from our log that HEAD is pointing to both master, and that feature is on the same point in history. So checking out to the feature branch should be then as easy as updating our HEAD base reference:

$ echo "ref: refs/heads/feature" > .git/HEAD
$ git status
On branch feature
nothing to commit, working tree clean

As we can see we are now on that branch, as HEAD points to that reference.

Let us now create a file the same way we did before in order to observe how the divergence of paths is represented using this structure:

$ echo "Luigi" | git hash-object --stdin -w
cac2640e641943f5b71bd076589dce53923ad7bd

$ git cat-file -p cac2640e641943f5b71bd076589dce53923ad7bd > sidekick.txt

We then can add it to our index:

$ git update-index --add --cacheinfo 100644 cac2640e641943f5b71bd076589dce53923ad7bd sidekick.txt
$ git status
On branch feature
Changes to be committed:
  (use "git restore --staged ..." to unstage)
	new file:   sidekick.txt

We are now able to trivialy create a commit for this feature by first creating a tree:

$ git write-tree
4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
$ git cat-file -p 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd	itsame.txt
100644 blob cac2640e641943f5b71bd076589dce53923ad7bd	sidekick.txt

There we can see above that we have now two blobs on this snapshot in time, so we just commit it but now specifying a (-p) parent commit as our initial commit:

$ git commit-tree 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17 -m "adding a friend for mario" -p f81a29be9727635c552d3c16fc541fed104b3e24
6c06460452f79dae525c00b7deb838eb5987d5dc

Looking at git-log won't be doing us any good if we don't update our branch ref to point to this new commit:

$ echo 6c06460452f79dae525c00b7deb838eb5987d5dc > .git/refs/heads/feature

And if we now look at the log we can see that we have now properly created a divergence that is being tracked correctly by git:

ose@syn:~/src/gitrepo$ git log
commit 6c06460452f79dae525c00b7deb838eb5987d5dc (HEAD -> feature)
Author: jose 
Date:   Sun Mar 17 14:46:39 2024 +0000

    adding a friend for mario

commit f81a29be9727635c552d3c16fc541fed104b3e24 (master)
Author: jose 
Date:   Sun Mar 17 14:28:00 2024 +0000

We plubmed our way into the actual inner management that git does to track its objects via references.

wrap up

After this small article we have now explored and expanded the initial experience of using git objects and verifying how a git repository is nothing more than a collection of objects and a system of referring to them via refs.

Git object basics I

José Tapadas Alves — Thu, 07 Mar 2024 22:35:25 GMT

Let us think about Git in a simplified manner, namely looking at it as if we're just maintaining a filesystem state. Let us consider that we have the a simple directory to track using git:

In terms of internal git objects we'll expose then 3 types that we can use to represent this structure:

BLOB: a binary representation of the contents of the file
TREE: a representation of a binary listing of blobs and other trees
COMMIT: a snapshot of the working tree

In the following sections we'll explore briefly what are these entities.

trees 🌲

A Git tree object represents a directory. It essentially functions as a description of the file system hierarchy at a particular moment in time. A tree object can reference other trees (subdirectories) and blobs (files), mimicking the structure of a file system directory. The tree objects are identified by their SHA-1 hash value, that is calculated based on:

the contents of the tree object
the names, permissions, ... and subsequent SHA-1 hashed identifiers of the included blobs and other trees.

For our repo we can check the contents of the tree as follows:

jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

And for constructing the actual full structure of our repository we can as well check the tree from our docs directory by using the following syntax within the same reference:

jose@syn:~/src/gitint$ git ls-tree HEAD:docs
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	me.jpeg

But the same can be obtained by providing the tree identifying SHA-1 hash:

jose@syn:~/src/gitint$ git ls-tree bb6e0e61969fafcc94435ace3fa644237667bacb
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	me.jpeg

With the above object representation we can then reconstruct the abstracted directory listing of both trees and blobs from our repo.

blobs 🩻

A git blob (binary large object) is a fundamental data structure in git. It is used as the storage of a file in a repository, at a given time, and once created it is immutable. Unlike a file in a filesystem, a blob does not contain any kind of metadata relevant for git and it is identified by its own SHA-1 hash.

As we saw above, we can obtain the blob SHA-1 hashes of the files by recurring thru their definition in the tree objects. Let us obtain the list of blobs on our repo by:

jose@syn:~/src/gitint$ git ls-tree -r -l HEAD
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239      16	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 1713408	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c      26	readme.md

So in order for us to actually check the binary representation of a file, as git abstracts it, we can then:

jose@syn:~/src/gitint$ git cat-file -p fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c | hexdump -C
00000000  57 65 6c 63 6f 6d 65 20  74 6f 20 6d 79 20 73 61  |Welcome to my sa|
00000010  6d 70 6c 65 20 72 65 70  6f 0a                    |mple repo.|

commit 📸

A git commit is an object that represents a snapshot of the full working tree at a given point in time, along with its content. It contains two main sections:

a pointer to the main tree, or our "root directory"
a metadata section that includes things like the commit author's name, the commit time, the commit message and usually one or a set of parent commit objects (or snapshots) of our structure on a given time.

As expected by now, the commit object is also identified by its SHA-1 hash. This is usually what we see when we check the git log. One must note that a git commit includes the entire snapshot of the objects at that point in time, not only the diffs from the
committer.

Above we mentioned about how blobs are immutable. This means that upon a change on a certain file, and when making a new commit, only the new modified file will have a different SHA-1 hash. For the ones that are kept the same (as the hash is not changing), nothing is changed within the tree references. This is how, within a commit, git optimizes the materialized changes on the filesystem: only updating what is in fact changed and keeping the blobs from parent commits that are not modified (whose SHA-1 hash is the same).

Let us recall the blob info of docs/bio.txt file:

jose@syn:~/src/gitint$ git cat-file -p 837c50063e6a1a55d8862632cb122da0d42b8239 | hexdump -C
00000000  48 45 4c 4c 4f 20 49 20  61 6d 20 6a 6f 73 65 0a  |HELLO I am jose.|

Our root tree:

jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

And our blobs within (recursively from the root)

jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

We'll now modify the file and commit the changes:

jose@syn:~/src/gitint$ echo "My spoon is too big" >> docs/bio.txt 
jose@syn:~/src/gitint$ git commit -am "new bio version"
[main d1f46f8] new bio version
 1 file changed, 1 insertion(+)

Now listing our tree we get a different hash:

jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree af1ef4ba103d5da65db58bedae6ffa151faa6017	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

So the updated af1ef4ba103d5da65db58bedae6ffa151faa6017 is a different computed value than the one we got before the commit (as bb6e0e61969fafcc94435ace3fa644237667bacb), this is of course because our blob for that specific file also changed:

jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 7c06157b32c6f934709ae6a95ec46118cf5c0f7d	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md

But if we compare with the blob hashes before the commit we verify that only the docs/bio.txt file has changed. This is then how git optimizes file system storage, by only updating within each snapshot (commit) the objects that are changes, and keeping the pointer the same for what is not touched.

Besides information about the blobs, one of the meta data of a commit is as well the parent commit that it was based on. This is also present on every commit. Let us consider our current git log:

jose@syn:~/src/gitint$ git log
commit d1f46f80cea9e965ccc5c1d372461dd124ec1712 (HEAD -> main)
Author: Jose Alves 
Date:   Thu Mar 7 22:05:58 2024 +0000

    new bio version

commit 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55
Author: Jose Alves 
Date:   Thu Mar 7 18:12:37 2024 +0000

    initial commit for the git internals repo

We can see that the initial commit has a nil parent:

jose@syn:~/src/gitint$ git show --format="%P" 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55 -s

But our subsequent commit, currently at HEAD, showcases what where its parent(s):

jose@syn:~/src/gitint$ git show --format="%P" d1f46f80cea9e965ccc5c1d372461dd124ec1712 -s
8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55

Another trivial detail may be reminded that:

the same exact changes on a specific file, by multiple authors, will still get the same blob and SHA-1 hash value
the commit will always be different as not only the blob info but as well the metadata associated with the commit (author data, time, ...) will also contribute to generate a new hash

wrap up

This is then as simple exposure of three basic and root objects on the git representation of a filesystem evolution through time, taking a "snapshot" (via commits) of a linked list of trees that reference other trees or the blobs for the files that are included and identified by their SHA-1 hashes that trace every change made on them.

Practical Asynchronous Iteration in JavaScript

José Tapadas Alves — Thu, 22 Feb 2024 15:18:37 GMT

With the introduction of ES6, we acquired the support for synchronously iterating over data. We could, of course, already iterate over iterable built-in structures like objects or arrays, but the big introduction was the formalization of an implementable interface to create both our iterables and generators.

But what about the scenarios where our iterations are done over data that is obtained from an asynchronous source, such as a set of remote HTTP calls or reading from a file?

In this article, we will conduct a practical analysis of the “Asynchronous Iteration” proposal, which is intended to add:

“support for asynchronous iteration using the AsyncIterable and AsyncIterator protocols. It introduces a new IterationStatement, for-await-of, and adds syntax for creating async generator functions and methods.”

For this guide, only basic knowledge of JavaScript (or programming in general) is required. All our examples, which will be presented with some simple TypeScript annotations, can be found in this GitHub repository.

Recap of Synchronous Iterators and Generators

In this section, we will do a quick review of synchronous iterators and generators in JavaScript so we can more easily extrapolate them into the asynchronous case.

What is an iteration anyway?

Let us look into this question by self-realizing what we usually have at hand when iterating. For example, on one side, we have our arrays, strings, etc. being basically our sources. On the other side, we have what we usually use as our means to consume our data via an iteration — namely our for loops, spread operators, etc.

Based on this, we can look at an iteration as a protocol that, when implemented by our sources, will allow consumers to sequentially “consume” its contents using a set of regular operations. This protocol could then be represented by the following interface:

1.1 Interface for defining a synchronous iterable, iterator, and subsequent result for every iteration.

So, putting it verbosely for those readers who may not be familiar with TS interface descriptions:

A SynchronousIterable provides a method via a Symbol.iterator that would return a SynchronousIterator.
Our SynchronousIterator would then return our IteratorResults from its implementation of the .next() method.
The IteratorResults would then contain a value to hold the current iterated value as well as a done flag that is set to true after the last item is iterated through (and false while iterating).

Note: You can find out more about this by reading the ECMAScript 2021 Language Specification documentation.

An example of using this interface can be easily showcased by manually iterating over an array:

1.2 Simple manual iteration. Notice the fact that “done” is set to true when we are over-transversing the object.

Sources that implement this interface can also be iterated via a for..of iteration directive that you have probably made use of at some point:

1.3 Using for..of to iterate over our source.

Of course, we would expect sources like the built-in array (as used above) to be iterable naturally. To then showcase it differently, we’ll implement that interface, for example, to generate a range of numbers:

1.4 A custom implementation of an iterable source that generates a range of numbers from “start” to “end.”

What about generators?

Usually, functions return either a single value or none. We can think of generators as entities that can return, in sequence, multiple values. To this holding of values, it was attributed the concept of yielding.

1.5 Defining a generator function that yields at the time the same 1, 2, 3 sequence.

These generator functions do not behave as regular functions, as they are lazily evaluated. So when called, they will return you a generator object that will be responsible for managing its execution. These generators are also iterable, meaning they also implement our interface so we can actually loop over them similarly as above:

1.6. Checking the values from our generator function. And yep, the main method of a generator is also .next().

Also, it is also very important to denote that the .next() function is key to obtaining the next yielded value for these generator objects. This will then produce the expected outputs.

As we can now yield values (instead of state), we can then re-implement our iterable range from Figure 1.4, but now using a generator function:

1.7 Our ranged iterable source now implemented using a generator function.

As we can see, we can now leverage a generator function to simplify our original ranged iterable.

An Asynchronous Interface Proposal

After grasping the idea behind the interface definition of an iterable (Figure 1.1), it would be easy to now extend it in a way where every step of our iteration would then return the result of an asynchronous operation. The usual representation of this is via promises:

2.1. An interface for asynchronous iterables.

From the definition above, we can easily identify that the asynchronous operation is indeed when providing the .next() element of the iteration. Therefore, it is trivial to proceed with implementing it in a way that handles the results as promises. Let us make this clear by adapting our ranged iterator (from Figure 1.5) in this way with a faux delay:

2.2 An asynchronous ranged iteration with for…of.

We can use this concept to actually abstract our generator in (1.7). It would then be trivial to implement the same range using an asynchronous generator:

2.3 Implementation of an asynchronous generator

Regardless of how we generate the data, we may simply iterate over the elements as if an asynchronous source was yet another iterable. Actually, it is as long as it implements our interface.

To demonstrate this, the next section will use an asynchronous source (a Hacker News top stories feed) that we will then use to manipulate as we would with any other iterable structure.

Practical Exercise: a HackerNews Iterable

Based on the implementation of an asynchronous generator (2.3), it is now trivial to materialize a generator source for our posts. For this example, we will use the HN API:

3.1 Async generator for the top stories on HN.

This code simply tucks the asynchronous logic by implementing still the usual interface for an async iterator. For this example, we are limiting the entries to iterate over, which is optional. By yielding each iterated value, we can easily consume this source with the following simple implementation:

3.2 Iterating our asynchronous news source.

As expected, this loops and renders a list of comments for our source:

3.3 Iteration result from our HN generator.

We can now look at our data source and handle it as a simple data sequence, keeping the full asynchronous fetching and manipulation logic within our generator definition.

Summary

Hopefully, this article showcases that it is trivial to transform and observe our asynchronous data sources as well as iterables by applying simple language formalities already available in the language specification.

By generalizing the synchronous case to also cover asynchronous generation, we can now iterate over any iterable source regardless of the nature of the data source — as long as it implements our interface. Looking at our asynchronous data sources as iterables opens a creative potential for our ideas towards more idiomatic and eloquent codebases.

For all the code examples, please refer to this GitHub repo.

A gentle introduction to gradient descent thru linear regression

José Tapadas Alves — Mon, 01 May 2023 15:21:00 GMT

o have a glimpse on the intricate nature of the reality that surrounds us, to understand the underlying relations between events or subjects, or even to assess the influence of a specific phenomenon on an arbitrary event, we must convolute reality into the information dimension attempting to leverage our human abstractions to somehow grasp some insights on the nature of what actually surrounds us.

In this article we will briefly analyze one simple statistical tool that allows us to model a slice of reality by trying to assess, based on a set of Observations, how we can leverage a set of Variables to properly generate a model that will predict or forecast a behavior, by inferring about the variables relations and mutual influence. This statistical tool is called Linear Regression.

In order to help minimizing the errors associated with the prediction model, optimizing it to better represent reality, we will also be briefly showing a simple application of Gradient Descent optimization algorithm.

This article assumes only a basic algebraic and calculus knowledge as both topics are simple but nevertheless represent two foundational subjects of modern statistics and machine learning.

Scenario

Let us consider a simple example by sampling some values from the “Developers Salaries in 2018” article from StackOverflow. Below we can see a small figure representing the average salary points, and how they evolve in Germany, as the developer experience advances in years:

Figure 1) Median yearly salaries for developers, in thousands of euros, by experience in Germany (2018)

Based on the above data points, we would like to develop a simple model function that would allow us to predict how the salaries evolve at any given point of experience time.

Linear regression

A linear function can then be defined by the simple expression:

With the constant m representing the slope of the function line and b usually referred as the intercept. Some examples can be seen below for showcasing different values of (m, b):

Figure 2) Three examples of slopes and intercepts for a linear function

In this example above we can see that changing m influences the slope of the resultant value, as well as the intercept b who modifies the function value when crossing x = 0.

On our current scenario, as the variation of the salaries are not indeed represented by a linear progression, nevertheless, looking to the data shape on Figure 1, it would be acceptable to approximate the predicted resulting salaries by fitting a line within the points progression in a way that:

Finally, our model would try to represent itself as the line that would approximate the evolution of our parameters, salary and experience, by somehow tweaking m and b, such as it would allow us to obtain something like:

Figure 3) Possible linear model used to predict the median salaries

To this “linear” representation of a basic model that represents relationship between our parameters we can refer to as Linear Regression.

The Cost of our Errors

Now that we’ve identified what shape we’ll be using to generate our model we must then try to figure out what are the values of (m, b) that would better describe the evolution of our data prediction model.

But how can we choose the values of m and b that would generate us the line we search for? An approach would be to compute the errors between the values our model generates and the actual data we have in place.

A simplistic approach for representation purposes only could be:

we know that developer with a around 10 years of experience earns around 72K euros of yearly salary (1)
starting with a example slope and intercept of (m, b) = (3, 35)

A sample error function, for this specific data point of 10 years of experience (x=10) our error E would be:

For all of our existing sample data points, we can then compute the error that takes into account the sum of all errors between our prediction and the real value, resulting in a function as such:

with:

n being the total samples of our data set
y being the actual salary value for a specific observation
x being the number of experience years that we want to predict for y

To this sum of the squared differences between each observation and its group’s mean, that would represent our Error Function (or Cost Function), we name the sum of the squared differences between each observation and its group’s mean: Sum of Squared Errors (SSE). In statistics, this mean squared error is very useful to assess the “quality” of our prediction values against real observations.

But how can we then proceed on finding the proper values for m and b? An intuitive approach could be, as we are now able to compute an Error Function, to find the pair of (m, b) that minimizes this function. If so, we can then clearly state that we have a prediction that produces the minimal error and therefore more closely represent reality.

Let us then choose two random values for (m, b), compute the cost function and then change these values to try to find the minimum of our error function. Let us consider initially (m, b) = (3, 0), and our data points from Figure 1, we obtain the following graphical result:

Figure 4) Initial prediction and errors for (m, b) = (3, 0)

From the above figure we can see:

the green dots representing our observed data values for the salaries
the blue line we have our prediction model (y = 3 * years of experience + 0)
the dotted red line represent our Error for the current parameters (m, b)

For this specific set of intercept and slope, let us now compute our accumulated cost, for the existing observations:

Let us now fix the slope value to 3 but increase our intercept to 20, (m, b) = (3, 20). We’ll obtain the following representation:

Figure 5) Prediction iteration 1, and errors for (m, b) = (3, 20)

With an associated cost of E = 232.5. We can clearly see that by updating our intercept, we have improved our prediction as the error as dropped dramatically. Let us now plot multiple scenarios for different values of the intercept:

Figure 6) Computing the errors by varying the value of the intercept

As we can see from the figure above, as we increase the value of the intercept b we can also observe the changes on the cost function. On this specific example it is trivial to identify the pink line with (m, b) = (3, 30) to be the more accurate prediction of our observed values, as it also has the lower cost value.

By plotting the variation of the error cost-function, obtained by varying the intercept value, we are presented with the following figure:

Figure 7) Evolution of the cost function when changing the intercept value

We can clearly see that, when varying the value of b, taking into account that our Error Function is convex, we are then able to find a local minimum that will represent the minimal error of our prediction model. In this simple demonstration above it is clear to state that the intercept b that minimizes our error is somewhere between [30, 40]. Unfortunately, simply iterating, with a pre-defined step, in order to find this minimum is very expensive and time consuming.

But how can we then compute this minimum of our cost function more cleverly? We’ll then be using the Gradient Descent algorithm.

Gradient Descent

The gradient descent is an iterative optimization algorithm that allows us to find the local minimums of a specific function.

A very nice example to explain the logic behind this algorithm, and recurrent in literature, is the one of the blind alpinist. Let us imagine that a blind alpinist wants to climb to the exact top of the mountain, with the least number of steps possible:

Figure 9) Sequence of steps a smart blind alpinist would take to climb a mountain

As the alpinist is blind, he will be assessing about the inclination of his current position in order to choose the magnitude of the next step he should take:

if the inclination of the mountain (slope) of his current position is high, he can safely take a big step (as we can notice for example on the transition from the 1st step to the 2nd)
when the slope is getting smaller, as he reaches the top of the mountain, he knows he needs to take smaller steps in order to reach approximately to the exact higher point (as the alpinist is getting closer to the top, from the 6th to the 7th step he is more careful on how to increase his position)
for positive slopes he needs to keep on going upwards to the top
for negative slopes, he is going down, so he needs to go back to the top

The slope of the “mountain” at a given point, is then given by the derivative of that function on a specific point:

Figure 10) Slope of the function at a given point

Therefore, by computing the derivative of our “mountain-function” at a certain point we can then infer about the nature of the step that we’ll be needing in order to properly reach our local minimum. By reaching a slope close to 0 (the yellow slope on the top, while compared to the blue slope value on the beginning of the mountain).

It is also trivial to understand that the same is valid for the convex version of this function, by switching the alpinist challenge from reaching the top of a mountain to actually reach the bottom of a valley:

Figure 11) Iterative finding of our local minimum for a convex function (or a valley in this alpine example)

Going back to our case of study, and taking into account that we want to to properly estimate the values of the slope and the intercept that minimize the Cost, we can then use this concept to minimize the convex function that is in fact our Cost Function.

For simplicity, let us initially just try to predict the actual value of the intercept by still keeping our slope fixed at m = 3. Our cost function would then be:

As we now know the equation of this curve, we can take its derivative and determine the slope of it at any value of the intercept.

Let us now compute the derivative of our cost function, in terms of our intercept by using the chain rule:

Now that we have properly computed the derivative we may now use the gradient descent to find where our Cost Function has its local minimum.

It would be indeed trivial to compute this specific minimum by finding the place where the derivative (slope) would be dE(b)/db = 0. Nevertheless this is not possible in many computational problems. Therefore we will apply the Gradient Descent to, starting from an initial guess, learn about the nature of this minimum. This versatility when we are unable to compute the derivative is in fact what makes this optimization algorithm so useful in so many contexts, such as modern machine learning problems.

Learning the proper value

Now that we have our derivative function, let us first compute the slope for a random value if the intercept b, such as:

With this we know that, when the intercept is 0, the slope of the tangential line on this point, on our cost function is then -69. As soon as we approach the minimum of the function, this slope would then be also close to 0.

From our alpine example we understood that the size of the step we should take should be somehow related with the slope at a given point. This has the objective of giving “bigger” steps when the slope is higher and we are far from the minimum, and giving “smaller” steps when we are getting closer to a null slope.

As we are doing this process iteratively, just us adopt the image we describe above with the actual required step sizes and adjust them on each iteration. To the constant that we will use to actually adapt the step size we call the learning rate. With this idea in mind we can defined then the following expression to generate and adapt our step size, on every iteration:

Let us assume a learning rate of 0.2, we would then obtain the following Step Size:

Taking into account our new step size, we can safely compute the next iteration intercept as being the actual step size:

So for our first iteration we have:

For this new intercept value we can see that our slope, for the error function, is then given by:

As the slope is closer to 0 we can then understand that we are actually moving closer to the optimal value just by doing a first iteration. By revisiting figure 6 we can indeed infer that by increasing the intercept from 0 to a bigger value we are indeed reducing the residual error between our estimates and the actual observations.

Doing a couple of iterations we obtain:

Step Size(2) = -41.4 * 0.2 = -8.28
b(2) = 13.8 - (-8.28) = 22.08
dE(22.08)/db = -24.8Step Size(3) = -24.8 * 0.2 = -4.96
b(3) = 22.08 - (-4.96) = 27.04
dE(27.04)/db = -14.8Step Size(4) = -14.8 * 0.2 = -2.96
b(4)/db = 27.04 - (-2.96)= 30
dE(30)/db = -9

We can verify from this 3 iterations the following:

every step we are approaching a smaller absolute slope
as we approach a 0 slope we are doing smaller steps, by keeping the same learning rate

And visually we can see that for each iteration we are getting closer to a prediction line of our data-set, and the steps are actually getting smaller:

Figure 13) Applying the iteration values from the gradient descent for the intercept

In order to properly stop the iterations, to a certain acceptable value one should:

decide what would be the minimal step size per iteration, e.g., stop if the step size is smaller than 0.001
stop when we reach a certain number of iterations

By applying these rules we can verify that the algorithm stops a few iterations later:

[+] Iteration 5:Step Size = -0.592
b = 30.592
dE/db = -7.8160000000000025[+] Iteration 6:Step Size = -0.1184
b = 30.7104
dE/db = -7.579200000000007(...)[+] Iteration 9:Step Size = -0.0009472000000000001
b = 30.7397632
dE/db = -7.5204736000000025

Stabilizing with an intercept of b = 30.7397632. When plotted we obtain:

Figure 14) Using the stabilized predicted value given for a fixed slope to our intercept

With this approach we could verify that, by iterating progressively (at the pace of adapting the step size based on a learning cave), we could indeed approach towards the minimization of the error cost function, obtaining as plotted, a very more close model representation of the evolution. This was done by simply trying to predict one of the parameters, the intercept. On the following section we will then try to understand the evolution of this model with both our variables.

Moving into the new dimension

Now that we have learned how to estimate the intercept value for our model, let us now move a step outside our one dimension and apply the gradient descent to both the intercept and the slope.

First, as on the previous section, we will then compute the derivative of our cost function, in terms of our intercept, by using the chain rule:

Now we can proceed on finding the partial derivative of our Error function in terms of the slope:

To the set of partial derivatives, to all the dimensions of this function we call then the Gradient:

We will then use this gradient, such as on the previous section to then find the local minimum of our error function. This is the reason behind calling this algorithm Gradient Descent.

In order to do it so, we need then to extrapolate what we had done for the intersect on the previous section to actually predict both values, and adjusting their own inter-dependency. Exposing it as such:

As we are now approaching two variables, the problem could be again comparable to climbing a mountain, but with an extra dimension of complexity. You would need to adapt the pace for both the feet movement on the wall, but also a distinct one that would dictate the pace on the hands grip movement. Therefore, keeping the samelearning rate, we would then need to adapt two step sizes, one for the slope and anotherfor the intersect:

With the new step sizes we could then obtain the current prediction for both variables, per iteration n:

We would then compute again the gradient (namely the derivative for both variables, with the updated values):

Repeat the whole process until we reach a choosen limit for the iterative process. We will keep on using a limit on the step size.

In order to implement this small algorithm we will need also to tweak all the initial values. A proper limit to stop value should also be decided. Let us, for our example decide on:

the initial intersect as b = 30
the initial slope as m = 3
our learning rate will then be 0.001
and we will stop when the learning rate reaches 0.00001

This can be represented by this simplistic python script:

Running this simple script we obtain the following output:

Obtaining then the following predicted values, for both our variables:

Slope (m)= 2.5101
Intersect (b) = 40.5478

Applying these values to our prior plots containing the error spread we obtain the following:

Figure 15) Using the stabilized predicted value, based on the computed slope and intercept, compared with the initial value

We can then see from the figure above that our (linear) prediction is now way more close to predicting the modeled data, and we could then use our new model to actually infer about the relation between these two parameters.

Conclusion

We can try to make an initial prevision about the natural relation of a set of parameters, and even obtain a simplistic model of evolution, by using Linear Regression. On this article we tried to then use this simple numerical method as a way to clearly expose the basic functioning of the Gradient Descent algorithm and mainly how we can achieve an iterative optimization of a prediction by trying to minimize a convergent error function.

Even though its owned concepts may present themselves as very simplistic, both conceptually and mathematically, they serve as one of the foundational basis to deep learning and neural networks.

Predicting AirBnB prices in Lisbon: Trees and Random Forests

José Tapadas Alves — Mon, 16 Mar 2020 15:23:00 GMT

In this small article, we will quickly bootstrap a prediction model for the nightly prices of an AirBnB in Lisbon. This guide hopes to serve as a simplistic and practical introduction to machine learning data analysis, by using real data and developing a real model.

It assumes as well a basic understanding of Python and the machine learning library scikit-learn, and it was written on a Jupyter notebook running Python 3.6 and sklearn 0.21. The dataset, as well as the notebook, can be obtained on my Github account, or via Google’s dataset search.

1. Data exploration and cleanup

As the first step, we start by loading our dataset. After downloading the file it is trivial to open and parse it with Pandas and provide a quick list of what we could expect from it:

Index(['room_id', 'survey_id', 'host_id', 'room_type', 'country', 'city', 'borough', 'neighborhood', 'reviews', 'overall_satisfaction', 'accommodates', 'bedrooms', 'bathrooms', 'price', 'minstay', 'name', 'last_modified', 'latitude', 'longitude', 'location'], dtype='object')

Even though above we can confirm that the dataset was properly loaded and parsed, a quick analysis of the statistical description of the data may provide us with a quick insight of its nature:

From this table, we can actually infer about basic statistical observations for each of our parameters. As our model intends to predict the price, based on whatever set of inputs we’ll provide to it, we could check for example that:

the mean value of the nightly price is around 88 EUR
the prices range from a minimum of 10 EUR to 4203 EUR
the standard deviation for the prices is around 123 EUR (!)

The price distribution could be represented as follows:

As we can see, our distribution of prices concentrates, under the 300 EUR interval, having some entries for the 4000 EUR values. Plotting it for where most of the prices reside:

We can clearly see from our representation above that most of the prices, for a night in Lisbon, will cost between 0–150 EUR.

Let us now pry and have a sneak peak into the actual dataset, in order to understand the kind of parameters we’ll be working on:

From the description above, we should be able to infer some statistical data about the nature of the data. Besides the distribution set of parameters (that we will not be looking for now), we clearly identify two sets of relevant insights:

there are empty columns: country, borough, bathrooms, minstay
entries like host_id, survey_id, room_id, name, city, last_modified and survey_id may not be so relevant for our price predictor
there are some categorical data that we will not be able to initially add to the regression of the Price, such as room_type and neighborhood (but we'll be back to these two later on)
location may be redundant for now, when we have both latitude and longitude and we may need to further infer about the nature of the format of this field

Let us then proceed on separating the dataset in:

one vector Y that will contain all the real prices of the dataset
on matrix X that contains all the features that we consider relevant for our model

This can be achieved by the following snippet:

With our new subset, we can now try to understand what is the correlation of these parameters in terms of the overall satisfaction, for the most common price range:

The above plots allow us to check the distribution of all the single variables and try to infer the relationships between them. We’ve taken the freedom to apply a color hue based on the review values for each of our chosen parameters. Some easy reading examples of the above figure, from relationships that may denote a positive correlation:

the number of reviews is more common for rooms with few accommodations. This could mean that most of the guests that review are renting smaller rooms.
most of the reviews are made for the cheaper priced rooms
taking into account the visual dominance of the yellow hue, most of the reviews are actually rated with 5. Either this means that most of the accommodations are actually very satisfactory or, most probably, the large number of people that actually review, do it to give a 5 as the rating.

One curious observation is also that the location heavily influences price and rating. When plotting both longitude and latitude we can obtain a quasi geographical/spacial distribution for the ratings along Lisbon:

We can then add this data to an actual map of Lisbon, to check the distribution:

As expected, most of the reviews are on the city center with a cluster of reviews already relevant alongside the recent Parque das Nações. The northern more sub-urban area, even though it has some scattered places, the reviews are not as high and common as on the center.

2. Splitting the dataset

With our dataset now properly cleared we will then first proceed into splitting it into two pieces:

a set that will be responsible for training our model, therefore called the training set
a validation set that will be used to then validate our model

Both sets would then basically be a subset of X and Y, containing a subset of the rental spaces and their corresponding prices. We would then, after training our model, to use the validation set as a input to then infer how good is our model on generalizing into data sets other than the ones used to train. When a model is performing very well on the training set, but does not generalize well to other data, we say that the model is overfitted to the dataset.

For deeper information on overfitting, please refer to https://en.wikipedia.org/wiki/Overfitting

In order to avoid this overfitting of our model to the test data we will then use a tool from sklearn called https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html that basically will split our data into a random train of train and test subsets:

Training set: Xt:(10183, 6) Yt:(10183,)
Validation set: Xv:(3395, 6) Yv:(3395,)
-
Full dataset: X:(13578, 6) Y:(13578,)

Now that we have our datasets in place, we can now proceed on creating a simple regression model that will try to predict, based on our chosen parameters, the nightly cost of an AirBnb in Lisbon.

3. Planting the Decision Trees

As one of the most simplistic supervised ML models, a decision tree is usually used to predict an outcome by learning and inferring decision rules from all the features data available. By ingesting our data parameters the trees can learn a series of educated “questions” in order to partition our data in a way that we can use the resulting data structure to either classify categorical data or simply create a regression model for numerical values (as it is our case with the prices).

A visualization example, taken from Wikipedia, could be the decision tree around the prediction for the survival of passengers in the Titanic:

Based on the data, the tree is built on the root and will be (recursively) partitioned by splitting each node into two child ones. These resulting nodes will be split, based on decisions that are inferred about the statistical data we are providing to the model, until we reach a point where the data split results in the biggest information gain, meaning we can properly classify all the samples based on the classes we are iteratively creating. The end vertices we call “leaves”.

On the Wikipedia example above it is trivial to follow how the decision process follows and, as the probability of survival is the estimated parameter here, we can easily obtain the probability of a “male, with more than 9.5 years old” survives when “he has no siblings”.

(For a deeper understanding of how decision trees are built for regression, I would recommend the video by StatQuest, named Decision Trees).

Let us then create our Decision Tree regression by utilizing the sklearn implementation:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter='best')

We can verify how the tree was built, for illustration purposes, on the picture below:

Please find here a graphical representation of the generated tree @ Github.

We can also show a snippet of the predictions, and corresponding parameters for a sample of the training data set. So for the following accommodations:

We obtain the following prices:

array([ 30., 81., 60., 30., 121.])

After fitting our model to the train data, we can now run a prediction for the validation set and assess the current absolute error of our model to assess on how well it generalizes when not run against the data it was tested.

For this, we’ll use the Mean Absolute Error (MAE) metric. We can consider this metric as the average error magnitude in a predictions set. It can be represented as such:

It is basically an average over the differences between our model predictions (y) and actual observations ( y-hat), making the consideration that all individual differences have equal weight.

Let us then apply this metric to our model, using Scikit Learn implementation:

42.91664212076583

This result basically means that our model is giving an absolute error of about 42.935 EUR per accommodation when exposed to the test data, out of a 88.38 EUR mean value that we collected during the initial data exploration.

Either due to our dataset being small or to our model being naive, this result is not satisfactory.

Even though this may seem worrying at this point, it is always advised to create a model that generates results as soon as possible and then start iterating on its optimization. Therefore, let us now proceed on attempting to improve our model’s predictions a bit more.

Currently, we are indeed suffering for overfitting on the test data. If we imagine the decision tree that is being built, as we are not specifying a limit for the decisions to split, we will consequently generate a decision tree that goes way deep until the test features, not generalizing well on any test set.

As sklearn’s DecisionTreeRegressor allows us to specify a maximum number of leaf nodes as a hyperparameter, let us quickly try to assess if there is a value that decreases our MAE:

(Size: 5, MAE: 42.6016036138866)
(Size: 10, MAE: 40.951013502542885)
(Size: 20, MAE: 40.00407688450048)
(Size: 30, MAE: 39.6249335490541)
(Size: 50, MAE: 39.038730827750555)
(Size: 100, MAE: 37.72578309289501)
(Size: 250, MAE: 36.82474862034445)
(Size: 500, MAE: 37.58889602439078) 250

Let us then try to generate our model, but including the computed max tree size, and check then its prediction with the new limit:

36.82474862034445

So by simply tuning up our maximum number of leaf nodes hyper-parameter we could then obtain a significant increase of our model’s predictions. We have now improved on average ( 42.935 - 36.825) **~ 6.11 EUR** on our model's errors.

4. Categorical Data

As mentioned above, even though we are being able to proceed on optimizing our very simplistic model, we still dropped two possible relevant fields that may (or may not) contribute to a better generalization and parameterization of our model: room_type and neighborhood.

These non-numerical data fields are usually referred to as Categorical Data, and most frequently we can approach them in three ways:

1) Drop

Sometimes the easiest way to deal with categorical data is… to remove it from the dataset. We did this to set up our project quickly, but one must go case by case in order to infer about the nature of such fields and if they make sense to be dropped.

This was the scenario we analysed until now, with a MAE of: 36.82474862034445

2) Label Encoding

So for label encoding, we assume that each value is assigned to a unique integer. We can also make this transformation taking into account any kind of order/magnitude that may be relevant for data (e.g., ratings, views, …). Let us check a simple example using the sklearn preprocessor:

array([3, 3, 1, 0, 2])

It is trivial to assess then the transformation that the LabelEncoder is doing, by assigning the array index of the fitted data:

array(['double room', 'shared room', 'single room', 'suite'], dtype='

Let us then apply to our categorical data this preprocessing technique, and let us verify how this affects our model predictions. So our new data set would be:

Our categorical data, represented on our panda’s dataframe as an object, can then be extracted by:

['room_type', 'neighborhood']

Now that we have the columns, let us then transform them on both the training and validation sets:

Let us now train and fit the model with the transformed data:

35.690195084932355

We have then improved our predictor, by encoding our categorical data, reducing our MAE to ~ 35.69 EUR.

3) One-Hot Encoding

One-Hot encoding, instead of enumerating a fields’ possible values, create new columns indicating the presence or absence of the encoded values. Let us showcase this with a small example:

array([[0., 0., 0., 1., 0., 0., 1.], [0., 1., 0., 0., 0., 1., 0.]])

From the result above we can see that the binary encoding is providing 1 on the features that each feature array actually has enabled, and 0 when not present. Let us then try to use this preprocessing on our model:

So the above result may look weird at first but, for the 26 possible categories, we now have a binary codification checking for its presence. We will now:

add back the original row indexes that were lost during the transformation
drop the original categorical columns from the original sets train_X and validation_X
replace the dropped columns by our new dataframe with all 26 possible categories

Now we can proceed on using our new encoded sets into our model:

36.97010930367817

By using One Hot Encoding on our categorical data we obtain a MAE to ~ 36.97EUR.

This result may prove that One-Hot-Encoding is not the best fit for both our categorical parameters when compared with the Label Encoding and for both parameters at the same time. Nevertheless, this result still allowed us to include the categorical parameters with a reduction of the initial MAE.

5. Random Forests

From the previous section we could see that, with our Decision Tree, we are always balancing between:

a deep tree with many leaves, in our case with few AirBnB places on each of them, being then too overfitted to our testing set (they present what we call high variance)
a shallow tree with few leaves that is unable to distinguish between the various features of an item

We can imagine a “Random Forest” as an ensemble of Decision Trees that, in order to try to reduce the variance mentioned above, generates Trees in a way that will allow the algorithm to select the remaining trees in a way that the error is reduced. Some examples of how the random forest is created could be:

generating trees with different subsets of data. For example, from our set of parameters analysed above, trees would be generated having only a random set of them (e.g., a Decision Tree with only “reviews” and “bedrooms”, another with all parameters except “latitude”
generating other trees by training on different samples of data (different sizes, different splits between the data set into training and validation, …)

In order to reduce the variance, the added randomness makes the generated individual trees’ errors less likely to be related. The prediction is then taken from the average of all predictions, by combining the different decision trees predictions, has the interesting effect of even canceling some of those errors out, reducing then the variance of the whole prediction.

The original publication, explaining this algorithm in more depth, can be found on the bibliography section at the end of this article.

Let us then implement our predictor using a Random Forest:

33.9996500736377

We can see that we have a significant reduction on our MAE when using a Random Forest.

6. Summary

Even though decision trees are a very simplistic (maybe the most simple) regression techniques in machine learning that we can use in our models, we expected to demostrate a sample process of analysing a dataset in order to generate predictions. It was clear to demonstrate that with small optimization steps (like cleaning up the data, encoding categorical data) and abstracting a single tree to a random forest we could significantly reduce the mean absolute error from our model predictions.

We hope that this example becomes usefull as a hands-on experience with machine learning and please don’t exitate to contact me if I could clarify or correct some of what was demonstrated above. There would also be the plan on proceeding on further optimizing our predictions on this specific dataset in some future articles, using other approaches and tools so keep tuned :)

7. Further reading

Please find below some resources that are very usefull on understanding some of the exposed concepts:

StatQuest, Decision Trees: https://statquest.org/2018/01/22/statquest-decision-trees/
Bias–variance tradeoff: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
Breiman, Random Forests, Machine Learning, 45(1), 5–32, 2001: https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf

Setting up a simple Rails development environment with Docker for fun and profit

José Tapadas Alves — Wed, 25 Jan 2017 00:00:00 GMT

reating a development environment may seem like a trivial task for many developers. As time progresses, and we find ourselves dwelling through the life cycle of so many projects, one probably ends up with a fragile and cluttered development machine, filled with an entropic set of unmanageable services and library versions, ultimately getting to a point where things simply start to crack without any apparent reason.

With this small guide I hope to equip you with the set of tools and gears to create simple, manageable and isolated production-like development environments using Docker containers.

The Plan

For this specific example we will create a full contained Ruby on Rails development environment alongside with isolated common services it usually communicates with, namely: a PostgreSQL database, Sidekiq (and Redis to support it).

For creating and managing the isolated Linux containers we will use a combination of:

Docker: the tool that will allow us to run lightweight isolated containers for our app
Docker Compose: a tool that will help us manage multiple containers for the multiple services

As this guide presents itself with a more simplistic practical approach, and goes beyond explaining the intricacies of Docker itself, please refer to the Docker official documentation guides if you feel the need to get a grasp of more specific information on any of the tools presented above.

Entering Docker

We will then start by creating a container to run our code using Docker.

Even though the concept may seem analogous to a Virtual Machine (VM), a container does not fully virtualize the whole hardware and OS stack as a standard VM does. The container will include the application and its whole dependancies, running its processes on an isolated userspace, but sharing the host kernel with other containers. These containers can be looked upon as lightweight VMs (although they really aren’t) that provide a full virtual environment without the overhead that comes with booting up a separate kernel and simulating all the hardware.

To achieve this, Docker relies itself on both the Linux kernel and the Linux Containers (LXC) infrastructure.

Supervising the Supervisor

Not all operating systems actually support isolated userspaces. Linux supports them, but OSX and Windows don’t. So if you’re on OSX or Windows, we will have to use a virtualization solution after all in order to boot a kernel that does support them.

Historically this was achieved by spinning a VM on VirtualBox with a tiny Linux distribution on it to host the containers. Since last June, the Docker Team dropped VirtualBox leveraging both HyperKit, a lightweight macOS virtualization solution built on top of the native Hypervisor.framework (introduced on macOS 10.10), and for Windows it now uses the Microsoft Hyper-V solution.

For this particular example we will be working on a macOS machine so let us start by downloading:

Docker for Mac : Docker for Mac is a native Mac application architected from scratch, with a native user interface and auto-update capability, deeply integrated with OS X native virtualization, Hypervisor Framework.

Creating a new Image

The way Docker images are managed is a bit like projects are managed via git on GitHub. There is a public collection of open source images at the Docker Hub where we can docker pull existent images or docker push our contributions and custom configurations.

After having Docker up and running on our machine we can now start writing the recipe that will instruct it to build our image and the contained environment we intend to be working with. This is achieved by creating a Dockerfile on the root of our project:

Let us then go to through this simple configuration. The first line simply states that we will base this environment on a lightweight Ruby image from the official Ruby repository named ruby:2.3-slim (you can actually check its own Dockerfile):

FROM ruby:2.3-slim

Next we have a run list to install all the dependencies we need to have a basic development environment. You need this because the ruby:2.3-slimimage is very minimal, and doesn’t contain them out of the box:

RUN apt-get update && apt-get install -qq -y — no-install-recommends build-essential nodejs libpq-dev git tzdata libxml2-dev libxslt-dev ssh && rm -rf /var/lib/apt/lists/*

the build-essential package to have the GNU C compilers, GNU C Library, Make and standard Debian package building tools
nodejs as our choice for a JavaScript runtime for the asset pipeline
libpq-dev is the programmer’s interface to PostgreSQL
tzdata as a dependancy for the Ruby Timezone Library
libxml2-dev and libxslt-dev to build Nokogiri
ssh and git as two essential tools for any sane developer

Busting up the apt cache and removing the contents of /var/lib/apt/lists helps us keep the image size down.

The next block sets our working directory on an environment variable and creates the folder that will accommodate our Rails app. The WORKDIRinstruction basically sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile.

ENV APP_HOME /opt/fooapp
RUN mkdir -p $APP_HOME
WORKDIR $APP_HOME

Finally, as we will be vendoring our gems with Bundler on our vendor path /opt/fooapp/vendor/bundle we’ll finish by setting the required environment variables:

ENV GEM_HOME /opt/fooapp/vendor/bundle
ENV PATH $GEM_HOME/bin:$PATH
ENV BUNDLE_PATH $GEM_HOME
ENV BUNDLE_BIN $BUNDLE_PATH/bin

Now that our recipe is complete, it is time to test building the container. This can be done by simply running:

$ docker build -t samplefooappimage .

Afterwards we can confirm that, by listing all the available images, that it was in fact created:

$ docker imageREPOSITORY TAG IMAGE ID CREATED SIZE
samplefooappimage latest 8915a11cb4c6 12 minutes ago 484.7 MB
ruby 2.3-slim 68e02bf2b853 7 days ago 273.8 MB

Composing the Services Stack

Now the we have properly built our base image it is time to assemble and configure the services set that our application will be using. For this purpose we will be using Docker Compose, included on the installed Docker toolbelt. This tool will basically enable us to create and manage a multi-service, multi-container docker application.

We will start by creating an initial version of the configuration file with both our Rails app and PostgreSQL. On the root of our app, we edit a file nameddocker-compose.yml:

The configuration file is pretty straight forward. We are setting two service blocks. We will call database to the configuration to create and run PostgreSQL, and web for our Rails app. Please notice that Compose here will then create two separate Docker containers that will eventually communicate with each other, mimicking much more closely a real setup.

Let us now look a bit closer to each of the services’ configuration. For the database block:

the image keyword simply specifies the image that will be used to start building the container. In this specific case we will be using the official PostgreSQL image.
due to the volatile nature of Docker containers, we will need to persist the database data on our filesystem. To achieve that we are then specifying a mount point, from the host machine to the container, using the volumes keyword.
specify a filename containing a list of environment variables to be exported upon creation. For this example, the env_file will also be simple and contain the required PostgreSQL credentials info. We will then create it with some sample data:

For the Rails web service configuration is also pretty straightforward:

create a link with the database named service container. Using this links keyword will also make the linked service reachable using a hostname identical to service name, in this example: database
specify the Dockerfile path to build the image
specify a mounting point for synchronising our app’s code between our host machine and the container
expose, and map, the 3000 port on both the host and the container
specify the default command to run after the container is started
we will reuse the same .env file in order to have, on this container, the same configuration variables to simplify the database configuration

Before we can start our two services we have to initialise the Rails app by installing the gems on the container and configuring the database. This will be the first command we will run directly inside the web service container:

$ docker-compose run --rm web bundle install

After installing the dependancies we should update the application’s config/database.yml with the database configuration information we’ve created on thedocker-compose.yml file:

On this plain configuration file I would like you to notice two things:

the host is using the network alias we’ve specified on the compose link configuration
we are using the environment variables exported from our .env file

Now it is just a matter of creating the database inside the container:

$ docker-compose run --rm web bundle exec rake db:create

And start both services (the optional-d flag runs the containers in a detached mode):

$ docker-compose up -d

You can check the state of your containers by running docker-compose ps and stop them eventually by running docker-compose stop :

$ docker-compose ps Name Command State Ports
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
fooapp_database_1 postgres Up 5432/tcp
fooapp_web_1 bundle exec puma Up 0.0.0.0:3000->3000/tcp

If you are using OSX or Windows, the containers can also be managed by a tool called Kitematic (part of the Docker Toolbelt), if its installed, will also be available under the the Docker icon on the macOS’ menu bar:

Kitematic showing both running containers

As puma is bound on the default route you can now easily access your app, outside the container, by navigating to http://127.0.0.1:3000 as expected.

Adding Sidekiq to the Mix

Let us now build up on thispreliminarystack by spinning a new container with our codebase but for running Sidekiq. Toaddnew services we simply re-use the same simple configurations on our composition file.

As it require Redis to work, we simply add it by editing editing our docker-compose.yml file:

As you can see it is pretty straight forward :

we use an existent official Redis image
we specify the port forwarding
we specify a mounting point to persist the data at the host machine
and we also link this new service to our rails app just in case

After running docker-compose up you will now see that the new container is going to be built and started.

Finally we simple add a new container specifically for Sidekiq:

As you can see from the updated composition file we now have a sidekiq service, linked to both our database and redis (with a host alias to avoid confusion). We are also sharing our .env file as we need to also update it in order to reference our REDIS_URL so Sidekiq knows how to connect to it:

Then we simple install the new app dependancies on the container directly, and restart our services:

$ docker-compose run --rm sidekiq bundle install
(...)
$ docker-compose restart

We will now see all services, with the new Sidekiq up and running:

Debugging tip with pry

To enable us to use a tool like pry on this stack we will need to add a way to attach to a running container and being able to write to stdin. To do so we will add the compose configuration, forttyandstdin_open, accordingly,

Enabling this will allow us to attach to any running container and interactively engage with its current running process, in this scenario, with a tool like pry.

Adding a simple binding.pry break-point on our app, we can access it by first, identifying what is the running container ID that we want to attach to:

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
(...)
7fba610d5c62 fooapp_web “bundle exec puma” 0.0.0.0:3000->3000/tcp
fooapp_web_1
(...)

And attaching to it:

$ docker attach 7fba610d5c62From: /opt/fooapp/app/views/static_pages/root.html.erb @ line 5 ActionView::CompiledTemplates#_app_views_static_pages_root_html_erb___2205918372626965601_36280540: 1: A new website.
2:
3: This is not a new feature :)
4:
=> 5: <% binding.pry %>[1] pry(#<#>)>

From here you can engage with the breakpoint as you would do on your common development workflow.

Wrap up

Even though this is a very minimalistic set up, it does comprise a contained, multi services application foundation that serves itself as the basis for many of the projects you may end up working at. This setup also has the bonus of actually mimicking the machine structure of a multi-service application.

The simplistic approach of using this configuration files will not only keep your development workflow sane, if you are working on multiple projects with a diverse dependancies ecosystem, but will also dramatically ease the entry curve for any newcomers to virtually any project you have set up.

Please feel free to reference and contribute to the Dockerfile and Docker Compose configuration files used in this example.