<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[josé tapadas alves]]></title><description><![CDATA[Thoughts, stories and ideas.]]></description><link>https://jose.tapadas.dev/</link><image><url>https://jose.tapadas.dev/favicon.png</url><title>josé tapadas alves</title><link>https://jose.tapadas.dev/</link></image><generator>Ghost 5.79</generator><lastBuildDate>Wed, 15 Apr 2026 20:16:35 GMT</lastBuildDate><atom:link href="https://jose.tapadas.dev/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[How to sign PDFs in (Ubuntu) Linux]]></title><description><![CDATA[<p>So I had to sign a PDF in my Linux box, here is a quick guide on how to do it. </p><ol><li>Installed a opensource PDF editor, so I opted for <a href="https://okular.kde.org/pt-pt/?ref=jose.tapadas.dev">Okular</a>.</li><li>Then I generated both a private key (<code>pdf_signing.key</code>) and a self-signed certificate (<code>pdf_signing.crt</code>) to use</li></ol>]]></description><link>https://jose.tapadas.dev/how-to-sign-pdfs-in-ubuntu-linux/</link><guid isPermaLink="false">6895c4c378c9d047a15b9546</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Fri, 08 Aug 2025 09:46:56 GMT</pubDate><content:encoded><![CDATA[<p>So I had to sign a PDF in my Linux box, here is a quick guide on how to do it. </p><ol><li>Installed a opensource PDF editor, so I opted for <a href="https://okular.kde.org/pt-pt/?ref=jose.tapadas.dev">Okular</a>.</li><li>Then I generated both a private key (<code>pdf_signing.key</code>) and a self-signed certificate (<code>pdf_signing.crt</code>) to use for signing the documents, using <code>openssl</code>:</li></ol><pre><code>openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes -keyout pdf_signing.key -out pdf_signing.crt -subj &quot;/CN=Jose Alves/emailAddress=jose@alves.lol&quot;</code></pre><ol start="3"><li>Convert it to PKCS#12:</li></ol><pre><code>openssl pkcs12 -export -in pdf_signing.crt -inkey pdf_signing.key -out signing-certificate.p12 -name &quot;Jose Alves&quot;</code></pre><ol start="4"><li>Import it to the NSS database (where Okular will search for it):</li></ol><pre><code>pk12util -d ~/.pki/nssdb -i signing-certificate.p12</code></pre><ol start="5"><li>Open Okular and choose <code>Tools &gt; Digitally Sign..</code></li></ol>]]></content:encoded></item><item><title><![CDATA[Creating a git repo from scratch: plumber style]]></title><description><![CDATA[Have you ever wanted to create a new git repo from scratch using git low level plumbing commands? Of course not and that is why this small article will showcase it how Mario would do it.]]></description><link>https://jose.tapadas.dev/creating-a-git-repo-from-scratch-plumber-style/</link><guid isPermaLink="false">65f6f14345341204b41b8723</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Sun, 17 Mar 2024 14:51:32 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1615465502839-71d5974f5087?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDE2fHxzdXBlciUyMG1hcmlvfGVufDB8fHx8MTcxMDY4MjM5OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1615465502839-71d5974f5087?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDE2fHxzdXBlciUyMG1hcmlvfGVufDB8fHx8MTcxMDY4MjM5OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Creating a git repo from scratch: plumber style"><p>So in a <a href="https://jose.tapadas.dev/git-object-basics-1/" rel="noreferrer">previous article we have dwelled into the basics of git internals</a> and concluded in a succint way that we have the following main entities:</p><ul><li><strong>blob: </strong>a binary representation of the contents of a file</li><li><strong>tree</strong>: a representation of a directory listing consiting of <code>blobs</code> and other <code>trees</code></li><li><strong>commit:</strong> a snapshot fo the working tree</li></ul><p>And in this new entry we will be looking how do we create a repository, a commit and a branch, from scratch, without using any of the fancy and user friendly commands like <code>git init</code>, <code>git add</code>, <code>git commit</code>, etc... just because we can and because it is sunday and the kid is asleep.</p><p>As a small remark, the failed pun regardung plummers is actuallly a git concept that distinguish between what is referred as the <strong>porcelain</strong> commands (the user friendly commands that interface with git internals) and the <strong>plumber </strong>commands (the internal low level commands). This is of course an ever sadder pun to the actual works around a toilet as we are usually looking at things in the ceramic porcelain but when we need to deeply fix something (omg), we should go into the plumbing. For more context one may refer to: <a href="https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain?ref=jose.tapadas.dev">https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain</a></p><h3 id="so-what-is-a-repo-anyway">So what is a repo anyway?</h3><p>We can look at a usual project managed by git as containing the following 3 entities:</p><ul><li>a <strong>working directory</strong> on our filesystem with its inherent directory and file structure</li><li>our <strong>repository</strong> that is basically a set of <strong>commit</strong> objects that hold the representation of our <strong>working directory</strong> at a certain point in time</li><li>our <strong>index </strong>(or commonly referred as &quot;staging are&quot;) that is the materialization of the actual changed files on a specific point in time. A bit like our &quot;database of binaries&quot;.</li></ul><blockquote class="kg-blockquote-alt"><strong>A git repository is then a collection of objects and a system to name and reference those objects, usually referred as refs.</strong></blockquote><p>We intend, without using the benefits of our user friendly common commands, to create a repo from scratch, using only plumbing commands in order to graps a bit more of the internals of git.</p><h3 id="letsa-go-build-the-structure">Let&apos;sa go build the structure</h3><p>Let us then go down the pipe and create a new directory to hold our new repository and ask git what it thinks about it:</p><pre><code class="language-sh">$ mkdir gitrepo
$ cd gitrepo/
$ git status
fatal: not a git repository (or any of the parent directories): .git</code></pre><p>So we have created our folder but git is not looking at it as a repository (as expected). Here would be where we would use <code>git init</code> but that is out of reach for us today. As we mentioned before, a git repo is basically a set of commit objects and a set of references to them. You may have already looked that a git repo contains a very specific folder named <code>.git</code> so let us now create all of this structure that we&apos;re talking about:</p><pre><code>$ mkdir .git
$ mkdir .git/objects
$ mkdir -p .git/refs/heads
$ tree .git/
.git/
&#x251C;&#x2500;&#x2500; objects
&#x2514;&#x2500;&#x2500; refs
    &#x2514;&#x2500;&#x2500; heads

4 directories, 0 files</code></pre><p>So above we now have the usual structure: a folder to hold our git objects and a set of references. The <code>heads</code> folder will hold a special kind of reference that we usually refer as <code>branches</code>. Based on this can you already infer what is the nature of a <strong>branch</strong> in this mix? A <strong>branch</strong> then is nothing more than a named reference to a specific commit.</p><p>Asking git about the state now that we created the base structure:</p><pre><code>$ git status
fatal: not a git repository (or any of the parent directories): .git</code></pre><p>Nope, it is still not happy. This is because git itself does not know what to look for. By default the behaviour is to find a commit, any commit (which we don&apos;t have) and it usually uses a base pointer, a base reference, named <code>HEAD</code> to look for it. And what is a <code>HEAD</code>? Just a simple file that will point to a specific reference. Let us now create it:</p><pre><code>$ echo &quot;ref: refs/heads/master&quot; &gt; .git/HEAD
$ tree .git/
.git/
&#x251C;&#x2500;&#x2500; HEAD
&#x251C;&#x2500;&#x2500; objects
&#x2514;&#x2500;&#x2500; refs
    &#x2514;&#x2500;&#x2500; heads</code></pre><p>So now we have created our new reference in <code>HEAD</code> stating to where it was pointing to and we can verify what git thinks about it:</p><pre><code>$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use &quot;git add&quot; to track)</code></pre><p>Yup, now git, looking into <code>HEAD</code>, knows where we are now. Still if we try to check our git log:</p><pre><code>$ git log
fatal: your current branch &apos;master&apos; does not have any commits yet</code></pre><p>As expected it just provides a fatal error as we lack any kind of commit. As we mentioned before the refs (and branches included) are just named references to a kind of git objects named <code>commits</code> and that a <code>commit</code> is a file that contains either a <code>tree</code> or a <code>blob</code> so let us try to create a new binary <code>blob</code> to our object database (the <code>index</code>):</p><pre><code>$ echo &quot;Mario&quot; | git hash-object --stdin -w
5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd</code></pre><p>We use the <code>git-hash-object</code> (just <code>man</code> it for more), taking the content from <code>stdin</code> and we write the content to our <code>objects</code> list:</p><pre><code>$ tree .git
.git
&#x251C;&#x2500;&#x2500; HEAD
&#x251C;&#x2500;&#x2500; objects
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; 5e
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
&#x2514;&#x2500;&#x2500; refs
    &#x2514;&#x2500;&#x2500; head</code></pre><p>As we can see the binary blob is now added to our <code>.git/objects</code> folder. The subfolder starting with the first byte <code>5e</code> is just an optimization that git uses for looking up files based on the first byte of their sha-1 hashe value. We can verify the file by using the <code>git-cat-file</code> command for the returned hash:</p><pre><code>$ git cat-file -t 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
blob
$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd
Mario
</code></pre><p>So we can verify that that specific hashed object is in fact a <code>blob</code> and we can even see its contents. Let us check the status:</p><pre><code>$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use &quot;git add&quot; to track)</code></pre><p>Yup nothing there. No changes nor commits. This is because the file was not added to our &quot;binary database&quot; of tracked changes that we called <strong>index</strong> above. Let us then use the command <code>git-update-index</code> to update the index with this new <code>blob</code>:</p><pre><code>$ git update-index --add --cacheinfo 100644 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd itsame.txt

$ tree .git
.git
&#x251C;&#x2500;&#x2500; HEAD
&#x251C;&#x2500;&#x2500; index
&#x251C;&#x2500;&#x2500; objects
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; 5e
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
&#x2514;&#x2500;&#x2500; refs
    &#x2514;&#x2500;&#x2500; heads

5 directories, 3 files
</code></pre><p>We can see that the command has now created a new file named <code>.git/index</code> (our object database). In case you are wondering about that magic number <code>100644</code>, it simply refers to this blob as a normal file. If you&apos;re interested <a href="https://git-scm.com/book/sv/v2/Git-Internals-Git-Objects?ref=jose.tapadas.dev" rel="noreferrer">the docs about objects</a> state that: </p><blockquote>In this case, you&#x2019;re specifying a mode of&#xA0;<code>100644</code>, which means it&#x2019;s a normal file. Other options are&#xA0;<code>100755</code>, which means it&#x2019;s an executable file; and&#xA0;<code>120000</code>, which specifies a symbolic link.&#xA0;</blockquote><p>Let us check our status:</p><pre><code>$ git status
On branch master

No commits yet

Changes to be committed:
  (use &quot;git rm --cached &lt;file&gt;...&quot; to unstage)
	new file:   itsame.txt

Changes not staged for commit:
  (use &quot;git add/rm &lt;file&gt;...&quot; to update what will be committed)
  (use &quot;git restore &lt;file&gt;...&quot; to discard changes in working directory)
  deleted:    itsame.txt</code></pre><p>Now this is a quite interesting output:</p><ul><li>we see our <code>itsame.txt</code> file added to our <strong>index</strong> (that before this article we used to call &quot;staging area&quot;)</li><li>we have the same file being marked as <strong>deleted</strong></li></ul><p>This is because our binary data is stored on <strong>index</strong> but not on the actual filesystem, therefore, and due to this difference, git assumes the file was deleted at this point in time. Let us then create the file, with the same content of the blob, using the <code>git-cat-file</code> command:</p><pre><code>$ git cat-file -p 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd &gt; itsame.txt
$ git status
On branch master

No commits yet

Changes to be committed:
  (use &quot;git rm --cached &lt;file&gt;...&quot; to unstage)
	new file:   itsame.txt</code></pre><p>And now we have the proper change being tracked as something to be commited to.</p><h3 id="building-the-commit">Building the commit</h3><p>At this point we can then create our snapshot of the contents of our project by creating a <strong>commit</strong> object. We won&apos;t be using of course the porcelain <code>git-commit</code> command.</p><p>As we mentioned above a <strong>commit</strong> is in an object that points to a <strong>tree</strong>, representing our project in a specific moment in time. Let us then start by writing a <code>tree</code> file with our current state using the command <code>git-write-tree</code>:</p><pre><code>$ git write-tree
88e009ef712773b75ffdfba27ea1a87858f16a4a
jose@syn:~/src/gitrepo$ tree .git/
.git/
&#x251C;&#x2500;&#x2500; HEAD
&#x251C;&#x2500;&#x2500; index
&#x251C;&#x2500;&#x2500; objects
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 5e
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; 88
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; e009ef712773b75ffdfba27ea1a87858f16a4a
&#x2514;&#x2500;&#x2500; refs
    &#x2514;&#x2500;&#x2500; heads

6 directories, 4 files</code></pre><p>We can then see the object is created and added to our <code>.git/objects</code> structure. We can as well verify what it is:</p><pre><code>$ git cat-file -t 88e009ef712773b75ffdfba27ea1a87858f16a4a
tree
$ git cat-file -p 88e009ef712773b75ffdfba27ea1a87858f16a4a
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd	itsame.txt</code></pre><p>From here we see that it is in fact..a <code>tree</code> and it contains the type (regular file blob), the hash and the name of the file that is present on our project. Again, please be reminded that a <code>tree</code> is nothing more than an object with the representation of our working directory via referencing both <code>blobs</code> of files and other <code>trees</code> of directories (please refer to the <a href="https://jose.tapadas.dev/git-object-basics-1/" rel="noreferrer">previous article about git object basics </a>for a refresh if this is not clear by now).</p><p>Now that we have a <strong>tree</strong> we can proceed on creating a <strong>commit</strong> object that points to that it, using the command <code>git-commit-tree</code>:</p><pre><code>$ git commit-tree 88e009ef712773b75ffdfba27ea1a87858f16a4a -m &quot;initial commit&quot;
f81a29be9727635c552d3c16fc541fed104b3e24

$ tree .git/
.git/
&#x251C;&#x2500;&#x2500; config
&#x251C;&#x2500;&#x2500; HEAD
&#x251C;&#x2500;&#x2500; index
&#x251C;&#x2500;&#x2500; objects
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 5e
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; 6d7d2182cc8a7a0ab90cfceab2f628c52596dd
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 88
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; e009ef712773b75ffdfba27ea1a87858f16a4a
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; f8
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; 1a29be9727635c552d3c16fc541fed104b3e24
&#x2514;&#x2500;&#x2500; refs
    &#x2514;&#x2500;&#x2500; heads

7 directories, 6 files
</code></pre><p>We can see that the commit object is created and added to our <code>.git/objects</code> directory as well, as any other git object. Checking the content with this object for the more curious or inquisitive:</p><pre><code>$ git cat-file -t f81a29be9727635c552d3c16fc541fed104b3e24
commit
$ git cat-file -p f81a29be9727635c552d3c16fc541fed104b3e24
tree 88e009ef712773b75ffdfba27ea1a87858f16a4a
author jose &lt;jose@tapadas.dev&gt; 1710685680 +0000
committer jose &lt;jose@tapadas.dev&gt; 1710685680 +0000

initial commit</code></pre><p>We can see there that the type of the object is in fact a <code>commit</code>, we can see the <code>tree</code> that it refers to and more information about the commiter as well as the message. Checking then our status:</p><pre><code>$ git status
On branch master

No commits yet

Changes to be committed:
  (use &quot;git rm --cached &lt;file&gt;...&quot; to unstage)
	new file:   itsame.txt</code></pre><p>Nothing still happened. This is mostly because git has no idea where the commit is. It does know that <code>HEAD</code> is pointing to some ref, representing our <code>master</code> branch but it cannot make the connection between that and the actual commit object we have created. To do so, let us then add that commit hash to our <code>master</code> branch reference:</p><pre><code>$ echo f81a29be9727635c552d3c16fc541fed104b3e24 &gt; .git/refs/heads/master
$ git status
On branch master
nothing to commit, working tree clean</code></pre><p>This is what made git have a notion that our changes on that commit are in fact in the same state to where our <code>HEAD</code> is pointing to, and looking into our log:</p><pre><code>$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -&gt; master)
Author: jose &lt;jose@tapadas.dev&gt;
Date:   Sun Mar 17 14:28:00 2024 +0000

    initial commit</code></pre><p>We have then succesfully created our base repository.</p><h3 id="shall-we-branch-it">shall we branch it?</h3><p>So we have seen that a branch is nothing else than another reference to a commit, within <code>.git/refs/heads</code>. Let us then create a new branch without using the <code>git-branch</code> command:</p><pre><code>$ cat .git/refs/heads/master &gt; .git/refs/heads/feature
$ git log
commit f81a29be9727635c552d3c16fc541fed104b3e24 (HEAD -&gt; master, feature)
Author: jose &lt;jose@tapadas.dev&gt;
Date:   Sun Mar 17 14:28:00 2024 +0000

    initial commit
</code></pre><p>We can see from our log that <code>HEAD</code> is pointing to both <code>master</code>, and that <code>feature</code> is on the same point in history. So <em>checking out </em>to the <code>feature</code> branch should be then as easy as updating our <code>HEAD</code> base reference:</p><pre><code>$ echo &quot;ref: refs/heads/feature&quot; &gt; .git/HEAD
$ git status
On branch feature
nothing to commit, working tree clean</code></pre><p>As we can see we are now on that branch, as <code>HEAD</code> points to that reference.</p><p>Let us now create a file the same way we did before in order to observe how the divergence of paths is represented using this structure:</p><pre><code>$ echo &quot;Luigi&quot; | git hash-object --stdin -w
cac2640e641943f5b71bd076589dce53923ad7bd

$ git cat-file -p cac2640e641943f5b71bd076589dce53923ad7bd &gt; sidekick.txt</code></pre><p>We then can add it to our index:</p><pre><code>$ git update-index --add --cacheinfo 100644 cac2640e641943f5b71bd076589dce53923ad7bd sidekick.txt
$ git status
On branch feature
Changes to be committed:
  (use &quot;git restore --staged &lt;file&gt;...&quot; to unstage)
	new file:   sidekick.txt
</code></pre><p>We are now able to trivialy create a commit for this feature by first creating a <code>tree</code>:</p><pre><code>$ git write-tree
4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
$ git cat-file -p 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17
100644 blob 5e6d7d2182cc8a7a0ab90cfceab2f628c52596dd	itsame.txt
100644 blob cac2640e641943f5b71bd076589dce53923ad7bd	sidekick.txt</code></pre><p>There we can see above that we have now two <code>blobs</code> on this snapshot in time, so we just commit it but now specifying a (<code>-p</code>) <strong>parent</strong> commit as our initial commit:</p><pre><code>$ git commit-tree 4b5762a3a26e7d8ca1ff0794c3e2765a78ce2c17 -m &quot;adding a friend for mario&quot; -p f81a29be9727635c552d3c16fc541fed104b3e24
6c06460452f79dae525c00b7deb838eb5987d5dc</code></pre><p>Looking at  <code>git-log</code> won&apos;t be doing us any good if we don&apos;t update our <strong>branch</strong> ref to point to this new commit:</p><pre><code>$ echo 6c06460452f79dae525c00b7deb838eb5987d5dc &gt; .git/refs/heads/feature</code></pre><p>And if we now look at the log we can see that we have now properly created a divergence that is being tracked correctly by git:</p><pre><code>ose@syn:~/src/gitrepo$ git log
commit 6c06460452f79dae525c00b7deb838eb5987d5dc (HEAD -&gt; feature)
Author: jose &lt;jose@tapadas.dev&gt;
Date:   Sun Mar 17 14:46:39 2024 +0000

    adding a friend for mario

commit f81a29be9727635c552d3c16fc541fed104b3e24 (master)
Author: jose &lt;jose@tapadas.dev&gt;
Date:   Sun Mar 17 14:28:00 2024 +0000</code></pre><p>We plubmed our way into the actual inner management that git does to track its objects via references.</p><h3 id="wrap-up">wrap up</h3><p>After this small article we have now explored and expanded the initial experience of using git objects and verifying how a git repository is nothing more than a collection of objects and a system of referring to them via refs.</p>]]></content:encoded></item><item><title><![CDATA[Git object basics I]]></title><description><![CDATA[Let us think about Git in a simplified manner, namely looking at it as if we're just maintaining a filesystem state. ]]></description><link>https://jose.tapadas.dev/git-object-basics-1/</link><guid isPermaLink="false">65ea3edc45341204b41b86f9</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Thu, 07 Mar 2024 22:35:25 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1516918842892-1c43ea4ad867?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDJ8fHNuYXBzaG90fGVufDB8fHx8MTcwOTg1MDg5Nnww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1516918842892-1c43ea4ad867?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDJ8fHNuYXBzaG90fGVufDB8fHx8MTcwOTg1MDg5Nnww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Git object basics I"><p>Let us think about Git in a simplified manner, namely looking at it as if we&apos;re just maintaining a filesystem state. Let us consider that we have the a simple directory to track using git:</p><figure class="kg-card kg-image-card"><img src="https://jose.tapadas.dev/content/images/2024/03/image.png" class="kg-image" alt="Git object basics I" loading="lazy" width="628" height="406" srcset="https://jose.tapadas.dev/content/images/size/w600/2024/03/image.png 600w, https://jose.tapadas.dev/content/images/2024/03/image.png 628w"></figure><p>In terms of internal git objects we&apos;ll expose then 3 types that we can use to represent this structure:</p><ul><li><strong>BLOB</strong>: a binary representation of the contents of the file</li><li><strong>TREE</strong>: a representation of a binary listing of blobs and other trees</li><li><strong>COMMIT</strong>: a snapshot of the working tree</li></ul><p>In the following sections we&apos;ll explore briefly what are these entities.</p><h3 id="trees-%F0%9F%8C%B2">trees &#x1F332;</h3><p>A Git tree object represents a directory. It essentially functions as a description of the file system hierarchy at a particular moment in time. A tree object can reference other trees (subdirectories) and blobs (files), mimicking the structure of a file system directory. The tree objects are identified by their SHA-1 hash value, that is calculated based on:</p><ul><li>the contents of the tree object</li><li>the names, permissions, ... and subsequent SHA-1 hashed identifiers of the included blobs and other trees.</li></ul><p>For our repo we can check the contents of the tree as follows:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md
</code></pre><p>And for constructing the actual full structure of our repository we can as well check the <code>tree</code> from our <code>docs</code> directory by using the following syntax within the same reference:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree HEAD:docs
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	me.jpeg
</code></pre><p>But the same can be obtained by providing the <code>tree</code> identifying SHA-1 hash:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree bb6e0e61969fafcc94435ace3fa644237667bacb
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	me.jpeg
</code></pre><p>With the above object representation we can then reconstruct the abstracted directory listing of both <code>trees</code> and <code>blobs</code> from our repo.</p><h3 id="blobs-%F0%9F%A9%BB">blobs &#x1FA7B;</h3><p>A git blob (binary large object) is a fundamental data structure in git. It is used as the <strong>storage</strong> of a file in a repository, at a given time, and once created it is immutable. Unlike a file in a filesystem, a blob does not contain any kind of metadata relevant for git and it is identified by its own SHA-1 hash.</p><p>As we saw above, we can obtain the blob SHA-1 hashes of the files by recurring thru their definition in the <code>tree</code> objects. Let us obtain the list of blobs on our repo by:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree -r -l HEAD
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239      16	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346 1713408	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c      26	readme.md
</code></pre><p>So in order for us to actually check the binary representation of a file, as git abstracts it, we can then:</p><pre><code>jose@syn:~/src/gitint$ git cat-file -p fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c | hexdump -C
00000000  57 65 6c 63 6f 6d 65 20  74 6f 20 6d 79 20 73 61  |Welcome to my sa|
00000010  6d 70 6c 65 20 72 65 70  6f 0a                    |mple repo.|
</code></pre><h3 id="commit-%F0%9F%93%B8">commit &#x1F4F8;</h3><p>A git commit is an object that represents a <em>snapshot</em> of the full working tree at a given point in time, along with its content. It contains two main sections:</p><ul><li>a pointer to the main <code>tree</code>, or our &quot;root directory&quot;</li><li>a metadata section that includes things like the commit author&apos;s name, the commit time, the commit message and usually one or a set of parent <code>commit</code> objects (or snapshots) of our structure on a given time.</li></ul><p>As expected by now, the <code>commit</code> object is also identified by its SHA-1 hash. This is usually what we see when we check the <code>git log</code>. One must note that a git commit includes the <strong>entire</strong> snapshot of the objects at that point in time, not only the diffs from the<br>committer.</p><p>Above we mentioned about how <code>blobs</code> are immutable. This means that upon a change on a certain file, and when making a new commit, only the new modified file will have a different SHA-1 hash. For the ones that are kept the same (as the hash is not changing), nothing is changed within the tree references. This is how, within a commit, git optimizes the materialized changes on the filesystem: only updating what is in fact changed and keeping the blobs from parent commits that are not modified (whose SHA-1 hash is the same).</p><p>Let us recall the blob info of <code>docs/bio.txt</code> file:</p><pre><code>jose@syn:~/src/gitint$ git cat-file -p 837c50063e6a1a55d8862632cb122da0d42b8239 | hexdump -C
00000000  48 45 4c 4c 4f 20 49 20  61 6d 20 6a 6f 73 65 0a  |HELLO I am jose.|
</code></pre><p>Our root <code>tree</code>:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree bb6e0e61969fafcc94435ace3fa644237667bacb	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md
</code></pre><p>And our blobs within (recursively from the root)</p><pre><code>jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 837c50063e6a1a55d8862632cb122da0d42b8239	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md
</code></pre><p>We&apos;ll now modify the file and commit the changes:</p><pre><code>jose@syn:~/src/gitint$ echo &quot;My spoon is too big&quot; &gt;&gt; docs/bio.txt 
jose@syn:~/src/gitint$ git commit -am &quot;new bio version&quot;
[main d1f46f8] new bio version
 1 file changed, 1 insertion(+)
</code></pre><p>Now listing our <code>tree</code> we get a different hash:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree HEAD
040000 tree af1ef4ba103d5da65db58bedae6ffa151faa6017	docs
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md
</code></pre><p>So the updated <code>af1ef4ba103d5da65db58bedae6ffa151faa6017</code> is a different computed value than the one we got before the commit (as <code>bb6e0e61969fafcc94435ace3fa644237667bacb</code>), this is of course because our blob for that specific file also changed:</p><pre><code>jose@syn:~/src/gitint$ git ls-tree HEAD -r
100644 blob 7c06157b32c6f934709ae6a95ec46118cf5c0f7d	docs/bio.txt
100644 blob 7a0af0697e98691700c3bbe050f6910ca6c71346	docs/me.jpeg
100644 blob fdfa549e7693b012ad2fd1d97cc8af2c524f7c1c	readme.md
</code></pre><p>But if we compare with the blob hashes before the commit we verify that only the <code>docs/bio.txt</code> file has changed. This is then how git optimizes file system storage, by only updating within each snapshot (commit) the objects that are changes, and keeping the pointer the same for what is not touched.</p><p>Besides information about the blobs, one of the meta data of a <code>commit</code> is as well the <em>parent</em> commit that it was based on. This is also present on every commit. Let us consider our current git log:</p><pre><code>jose@syn:~/src/gitint$ git log
commit d1f46f80cea9e965ccc5c1d372461dd124ec1712 (HEAD -&gt; main)
Author: Jose Alves &lt;jose.tapadas@gmail.com&gt;
Date:   Thu Mar 7 22:05:58 2024 +0000

    new bio version

commit 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55
Author: Jose Alves &lt;jose.tapadas@gmail.com&gt;
Date:   Thu Mar 7 18:12:37 2024 +0000

    initial commit for the git internals repo
</code></pre><p>We can see that the initial commit has a <code>nil</code> parent:</p><pre><code>jose@syn:~/src/gitint$ git show --format=&quot;%P&quot; 8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55 -s
</code></pre><p>But our subsequent commit, currently at <code>HEAD</code>, showcases what where its parent(s):</p><pre><code>jose@syn:~/src/gitint$ git show --format=&quot;%P&quot; d1f46f80cea9e965ccc5c1d372461dd124ec1712 -s
8ba8dc9fd296abc38f63d9d598f3ed450c9cdb55
</code></pre><p>Another trivial detail may be reminded that:</p><ul><li>the same exact changes on a specific file, by multiple authors, will still get the same blob and SHA-1 hash value</li><li>the commit will always be different as not only the blob info but as well the metadata associated with the commit (author data, time, ...) will also contribute to generate a new hash</li></ul><h3 id="wrap-up">wrap up</h3><p>This is then as simple exposure of three basic and root objects on the git representation of a filesystem evolution through time, taking a &quot;snapshot&quot; (via commits) of a linked list of <code>trees</code> that reference other <code>trees</code> or the <code>blobs</code> for the files that are included and identified by their SHA-1 hashes that trace every change made on them.</p>]]></content:encoded></item><item><title><![CDATA[Practical Asynchronous Iteration in JavaScript]]></title><description><![CDATA[<p>With the introduction of ES6, we acquired the support for synchronously iterating over data. We could, of course, already iterate over iterable<em> </em>built-in structures like objects or arrays, but the big introduction was the formalization of an implementable interface to create both our iterables<strong> </strong>and generators.</p><p>But what about the</p>]]></description><link>https://jose.tapadas.dev/practical-asynchronous-iteration-in-javascript/</link><guid isPermaLink="false">65d7644c8600432538ab255b</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Thu, 22 Feb 2024 15:18:37 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1513579068076-168caae41e27?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDJ8fHN1cmYlMjBsaW5ldXB8ZW58MHx8fHwxNzA4NjE1MTk1fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1513579068076-168caae41e27?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDJ8fHN1cmYlMjBsaW5ldXB8ZW58MHx8fHwxNzA4NjE1MTk1fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Practical Asynchronous Iteration in JavaScript"><p>With the introduction of ES6, we acquired the support for synchronously iterating over data. We could, of course, already iterate over iterable<em> </em>built-in structures like objects or arrays, but the big introduction was the formalization of an implementable interface to create both our iterables<strong> </strong>and generators.</p><p>But what about the scenarios where our iterations are done over data that is obtained from an asynchronous source, such as a set of remote HTTP calls or reading from a file?</p><p>In this article, we will conduct a practical analysis of the <a href="https://tc39.es/proposal-async-iteration/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">&#x201C;Asynchronous Iteration&#x201D;</a> proposal, which is intended to add:</p><blockquote>&#x201C;support for asynchronous iteration using the AsyncIterable and AsyncIterator protocols. It introduces a new IterationStatement, <strong>for-await-of</strong>, and adds syntax for creating async generator functions and methods.&#x201D;</blockquote><p>For this guide, only basic knowledge of JavaScript (or programming in general) is required. All our examples, which will be presented with some simple TypeScript annotations, can be found in <a href="https://github.com/josetapadas/async_iterators/blob/master/src/index.ts?ref=jose.tapadas.dev" rel="noopener ugc nofollow">this GitHub repository</a>.</p><h1 id="recap-of-synchronous-iterators-and-generators">Recap of Synchronous Iterators and Generators</h1><p>In this section, we will do a quick review of synchronous iterators and generators in JavaScript so we can more easily extrapolate them into the asynchronous case.</p><h2 id="what-is-an-iteration-anyway">What is an iteration anyway?</h2><p>Let us look into this question by self-realizing what we usually have at hand when iterating. For example, on one side, we have our arrays, strings, etc.<strong><em> </em></strong>being<strong><em> </em></strong>basically our sources. On the other side, we have what we usually use as our means to consume our data via an iteration &#x2014; namely our <code>for</code> loops, spread operators, etc.</p><p>Based on this, we can look at an iteration as a protocol that, when implemented by our sources, will allow consumers to sequentially &#x201C;consume&#x201D; its contents using a set of regular operations. This protocol could then be represented by the following interface:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*WOaQQRQL-peefmTnXyjUEQ.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="523"><figcaption><span style="white-space: pre-wrap;">1.1 Interface for defining a synchronous iterable, iterator, and subsequent result for every iteration.</span></figcaption></figure><p>So, putting it verbosely for those readers who may not be familiar with TS interface descriptions:</p><ul><li>A <code>SynchronousIterable</code> provides a method via a <code>Symbol.iterator</code> that would return a <code>SynchronousIterator</code>.</li><li>Our <code>SynchronousIterator</code> would then return our <code>IteratorResults</code> from its implementation of the <code>.next()</code> method.</li><li>The <code>IteratorResults</code> would then contain a value to hold the current iterated value as well as a <code>done</code> flag that is set to <code>true</code> after the last item is iterated through (and false while iterating).</li></ul><p><em>Note: You can find out more about this by reading the </em><a href="https://tc39.es/ecma262/?ref=jose.tapadas.dev#sec-iteration" rel="noopener ugc nofollow"><em>ECMAScript 2021 Language Specification documentation</em></a><em>.</em></p><p>An example of using this interface can be easily showcased by manually iterating over an array:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*_83kTRsFEiQOppdpaLz1LQ.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="276"><figcaption><span style="white-space: pre-wrap;">1.2 Simple manual iteration. Notice the fact that &#x201C;done&#x201D; is set to true when we are over-transversing the object.</span></figcaption></figure><p>Sources that implement this interface can also be iterated via a <code>for..of</code> iteration directive that you have probably made use of at some point:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:462/1*WnctW7b1TOYDAiRQlklTsw.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="462" height="276"><figcaption><span style="white-space: pre-wrap;">1.3 Using for..of to iterate over our source.</span></figcaption></figure><p>Of course, we would expect sources like the built-in array (as used above) to be iterable naturally. To then showcase it differently, we&#x2019;ll implement that interface, for example, to generate a range of numbers:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*vSTuukAQ6tEcn4_LgtOgbA.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="624"><figcaption><span style="white-space: pre-wrap;">1.4 A custom implementation of an iterable source that generates a range of numbers from &#x201C;start&#x201D; to &#x201C;end.&#x201D;</span></figcaption></figure><h2 id="what-about-generators">What about generators?</h2><p>Usually, functions return either a single value or none. We can think of generators<strong> </strong>as entities that can return, in sequence, multiple values. To this holding of values, it was attributed the concept of yielding.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:479/1*dwW78X2V61yXSTJknoDwTg.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="479" height="276"><figcaption><span style="white-space: pre-wrap;">1.5 Defining a generator function that yields at the time the same 1, 2, 3 sequence.</span></figcaption></figure><p>These generator functions do not behave as regular functions, as they are lazily evaluated. So when called, they will return you a generator object that will be responsible for managing its execution. These generators are also iterable,<em> </em>meaning they also implement our interface so we can actually loop over them similarly as above:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*4psE_Mw2FGeheZYa9DL9Xg.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="320"><figcaption><span style="white-space: pre-wrap;">1.6. Checking the values from our generator function. And yep, the main method of a generator is also .next().</span></figcaption></figure><p>Also, it is also very important to denote that the <code>.next()</code> function is key to obtaining the next yielded value for these generator objects. This will then produce the expected outputs.</p><p>As we can now yield values (instead of state), we can then re-implement our iterable range from Figure 1.4, but now using a generator function:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*RWvfD7qO8vfwQmxsU07a-Q.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="424"><figcaption><span style="white-space: pre-wrap;">1.7 Our ranged iterable source now implemented using a generator function.</span></figcaption></figure><p>As we can see, we can now leverage a generator function to simplify our original ranged iterable.</p><h1 id="an-asynchronous-interface-proposal">An Asynchronous Interface Proposal</h1><p>After grasping the idea behind the interface definition of an iterable (Figure 1.1), it would be easy to now extend it in a way where every step of our iteration would then return the result of an asynchronous operation. The usual representation of this is via promises:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*RAoZvdJOCjHVOh43dfaoWA.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="478"><figcaption><span style="white-space: pre-wrap;">2.1. An interface for asynchronous iterables.</span></figcaption></figure><p>From the definition above, we can easily identify that the asynchronous operation is indeed when providing the <code>.next()</code> element of the iteration. Therefore, it is trivial to proceed with implementing it in a way that handles the results as promises. Let us make this clear by adapting our ranged iterator (from Figure 1.5) in this way with a faux delay:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*MGFEVEVqDcmYEVSCsqX3Ag.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="649"><figcaption><span style="white-space: pre-wrap;">2.2 An asynchronous ranged iteration with for&#x2026;of.</span></figcaption></figure><p>We can use this concept to actually abstract our generator in (1.7).<strong><em> </em></strong>It would then be trivial to implement the same range using an asynchronous generator:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*lhNDd7-SNXEbp2PiHG-t8A.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="529"><figcaption><span style="white-space: pre-wrap;">2.3 Implementation of an asynchronous generator</span></figcaption></figure><p>Regardless of how we generate the data, we may simply iterate over the elements as if an asynchronous source was yet another iterable. Actually, it is as long as it implements our interface.</p><p>To demonstrate this, the next section will use an asynchronous source (a Hacker News top stories feed) that we will then use to manipulate as we would with any other iterable structure.</p><h1 id="practical-exercise-a-hackernews-iterable"><strong>Practical Exercise: a HackerNews Iterable</strong></h1><p>Based on the implementation of an asynchronous generator (2.3), it is now trivial to materialize a generator source for our posts. For this example, we will use the <a href="https://github.com/HackerNews/API?ref=jose.tapadas.dev" rel="noopener ugc nofollow">HN API</a>:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*y9zjdE5ZsR9cqW6sFY0Z5A.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="546"><figcaption><span style="white-space: pre-wrap;">3.1 Async generator for the top stories on HN.</span></figcaption></figure><p>This code simply tucks the asynchronous logic by implementing still the usual interface for an async iterator. For this example, we are limiting the entries to iterate over, which is optional. By yielding each iterated value, we can easily consume this source with the following simple implementation:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*EcTH3VFav1fCT4KGG3hrrA.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="317"><figcaption><span style="white-space: pre-wrap;">3.2 Iterating our asynchronous news source.</span></figcaption></figure><p>As expected, this loops and renders a list of comments for our source:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*XQ07YPoCO0GvmrDCVY9cPA.png" class="kg-image" alt="Practical Asynchronous Iteration in JavaScript" loading="lazy" width="700" height="312"><figcaption><span style="white-space: pre-wrap;">3.3 Iteration result from our HN generator.</span></figcaption></figure><p>We can now look at our data source and handle it as a simple data sequence, keeping the full asynchronous fetching and manipulation logic within our generator definition.</p><h1 id="summary">Summary</h1><p>Hopefully, this article showcases that it is trivial to transform and observe our asynchronous data sources as well as iterables by applying simple language formalities already available in the language specification.</p><p>By generalizing the synchronous case to also cover asynchronous generation, we can now iterate over any iterable source regardless of the nature of the data source &#x2014; as long as it implements our interface. Looking at our asynchronous data sources as iterables opens a creative potential for our ideas towards more idiomatic and eloquent codebases.</p><p>For all the code examples, please refer to this <a href="https://github.com/josetapadas/async_iterators?ref=jose.tapadas.dev" rel="noopener ugc nofollow">GitHub repo</a>.</p>]]></content:encoded></item><item><title><![CDATA[A gentle introduction to gradient descent thru linear regression]]></title><description><![CDATA[<p>o have a glimpse on the intricate nature of the reality that surrounds us, to understand the underlying relations between events or subjects, or even to assess the influence of a specific phenomenon on an arbitrary event, we must convolute reality into the information dimension attempting to leverage our human</p>]]></description><link>https://jose.tapadas.dev/a-gentle-introduction-to-gradient-descent-thru-linear-regression/</link><guid isPermaLink="false">65d76657c7a18807246fefe6</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Mon, 01 May 2023 15:21:00 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1524351543168-8e38787614e9?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fHZvcnRleHxlbnwwfHx8fDE3MDg2MTUyODd8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1524351543168-8e38787614e9?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fHZvcnRleHxlbnwwfHx8fDE3MDg2MTUyODd8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="A gentle introduction to gradient descent thru linear regression"><p>o have a glimpse on the intricate nature of the reality that surrounds us, to understand the underlying relations between events or subjects, or even to assess the influence of a specific phenomenon on an arbitrary event, we must convolute reality into the information dimension attempting to leverage our human abstractions to somehow grasp some insights on the nature of what actually surrounds us.</p><p>In this article we will briefly analyze one simple statistical tool that allows us to model a slice of reality by trying to assess, based on a set of <strong>Observations, </strong>how we can leverage a set of <strong>Variables</strong> to properly generate a model that will predict or forecast a behavior, by inferring about the variables relations and mutual influence. This statistical tool is called <strong>Linear Regression</strong>.</p><p>In order to help minimizing the errors associated with the prediction model, optimizing it to better represent reality, we will also be briefly showing a simple application of <strong>Gradient Descent</strong> optimization algorithm.</p><p>This article assumes only a basic algebraic and calculus knowledge as both topics are simple but nevertheless represent two foundational subjects of modern statistics and machine learning.</p><h1 id="scenario">Scenario</h1><p>Let us consider a simple example by sampling some values from the <a href="https://stackoverflow.blog/2018/09/05/developer-salaries-in-2018-updating-the-stack-overflow-salary-calculator/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">&#x201C;Developers Salaries in 2018&#x201D;</a> article from StackOverflow. Below we can see a small figure representing the average salary points, and how they evolve in Germany, as the developer experience advances in years:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*5G2eI7OduGRgI7u-m8Q40Q.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 1) Median yearly salaries for developers, in thousands of euros, by experience in Germany (2018)</span></figcaption></figure><p>Based on the above data points, we would like to develop a simple model function that would allow us to predict how the salaries evolve at any given point of experience time.</p><h2 id="linear-regression">Linear regression</h2><p>A linear function can then be defined by the simple expression:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:171/0*zrh3JctOBl2tIC8-" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="171" height="24"></figure><p>With the constant <em>m</em> representing the slope of the function line and <em>b </em>usually referred as the intercept. Some examples can be seen below for showcasing different values of <em>(m, b)</em>:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*p204GeqDZvirikOMS81JZw.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 2) Three examples of slopes and intercepts for a linear function</span></figcaption></figure><p></p><p>In this example above we can see that changing <em>m </em>influences the slope of the resultant value, as well as the intercept <em>b</em> who modifies the function value when crossing <em>x = 0</em>.</p><p>On our current scenario, as the variation of the salaries are not indeed represented by a linear progression, nevertheless, looking to the data shape on Figure 1, it would be acceptable to approximate the predicted resulting salaries by <em>fitting a line</em> within the points progression in a way that:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:275/0*8s76g_PxeS565jvF" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="275" height="24"></figure><p>Finally, our model would try to represent itself as the line that would approximate the evolution of our parameters, <strong>salary</strong> and <strong>experience</strong>, by somehow tweaking <em>m</em> and <em>b</em>, such as it would allow us to obtain something like:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*PlMy_FRzK0seVBqHRyi-0w.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 3) Possible linear model used to predict the median salaries</span></figcaption></figure><p>To this &#x201C;linear&#x201D; representation of a basic model that represents relationship between our parameters we can refer to as <strong>Linear Regression</strong>.</p><h2 id="the-cost-of-our-errors">The Cost of our Errors</h2><p>Now that we&#x2019;ve identified what shape we&#x2019;ll be using to generate our model we must then try to figure out what are the values of <strong><em>(m, b) </em></strong>that would better describe the evolution of our data prediction model.</p><p>But how can we choose the values of <em>m</em> and <em>b</em> that would generate us the line we search for? An approach would be to compute the errors between the values our model generates and the actual data we have in place.</p><p>A simplistic approach for representation purposes only could be:</p><ul><li>we know that developer with a around <em><strong>10</strong> years of experience</em> earns<strong> </strong>around <em><strong>72K </strong>euros of yearly salary </em>(1)</li><li>starting with a example slope and intercept of <em>(m, b) = (3, 35)</em></li></ul><p>A sample error function, for this specific data point of 10 years of experience <em>(x=10)</em> our error <strong>E </strong>would be:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:258/0*2ouOeqCJnU4mSMWd" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="258" height="32"></figure><p>For all of our existing sample data points, we can then compute the error that takes into account the sum of all errors between our prediction and the real value, resulting in a function as such:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:306/0*eAY_u23T4a9j4PAf" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="306" height="65"></figure><p>with:</p><ul><li><strong><em>n</em></strong> being the total samples of our data set</li><li><strong><em>y</em></strong> being the actual salary value for a specific observation</li><li><strong><em>x</em></strong> being the number of experience years that we want to predict for <strong><em>y</em></strong></li></ul><p>To this sum of the squared differences between each observation and its group&#x2019;s mean, that would represent our <strong>Error Function </strong>(or <strong>Cost Function</strong>), we name the <em>sum of the squared differences between each observation and its group&#x2019;s mean</em>: <strong>Sum of Squared Errors (SSE).</strong> In statistics, this mean squared error is very useful to assess the &#x201C;quality&#x201D; of our prediction values against real observations.</p><p></p><p>But how can we then proceed on finding the proper values for <em>m </em>and<em> b? </em>An intuitive approach could be, as we are now able to compute an <strong>Error Function</strong>, to find the pair of <em>(m, b)</em> that minimizes this function. If so, we can then clearly state that we have a prediction that produces the minimal error and therefore more closely represent reality.</p><p>Let us then choose two random values for <em>(m, b)</em>, compute the cost function and then change these values to try to find the minimum of our error function. Let us consider initially (m, b) = (3, 0), and our data points from Figure 1, we obtain the following graphical result:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*JiNr8jcC7D0phualjRC3xA.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 4) Initial prediction and errors for (m, b) = (3, 0)</span></figcaption></figure><p>From the above figure we can see:</p><ul><li>the green dots representing our <strong>observed data values</strong> for the salaries</li><li>the blue line we have our <strong>prediction model</strong> (<em>y = 3 * years of experience + 0</em>)</li><li>the dotted red line represent our <strong>Error</strong> for the current parameters <em>(m, b)</em></li></ul><p>For this specific set of intercept and slope, let us now compute our accumulated cost, for the existing observations:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:580/0*Is3SrR28HdvRoyoQ" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="580" height="47"></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:408/0*nqrg_Sd47BTrSCit" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="408" height="47"></figure><p>Let us now fix the slope value to 3 but increase our intercept to 20, <em>(m, b) = (3, 20)</em>. We&#x2019;ll obtain the following representation:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*yKgqRi4DRrlLNQLEAm-KAg.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 5) Prediction iteration 1, and errors for (m, b) = (3, 20)</span></figcaption></figure><p></p><p>With an associated cost of <strong><em>E = 232.5</em></strong>. We can clearly see that by updating our intercept, we have improved our prediction as the error as dropped dramatically. Let us now plot multiple scenarios for different values of the intercept:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*XmjRRjxz7rd0BFSLQEosyw.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="521"><figcaption><span style="white-space: pre-wrap;">Figure 6) Computing the errors by varying the value of the intercept</span></figcaption></figure><p>As we can see from the figure above, as we increase the value of the intercept<em> b </em>we can also observe the changes on the cost function. On this specific example it is trivial to identify the pink line with<em> (m, b) = (3, 30)</em> to be the more accurate prediction of our observed values, as it also has the lower cost value.</p><p>By plotting the variation of the error cost-function, obtained by varying the <em>intercept </em>value, we are presented with the following figure:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*LDb9GkxOlEF6cnNzK0ekZQ.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 7) Evolution of the cost function when changing the intercept value</span></figcaption></figure><p>We can clearly see that, when varying the value of <em>b</em>, taking into account that our Error Function is convex, we are then able to find a local minimum that will represent the minimal error of our prediction model. In this simple demonstration above it is clear to state that the intercept <em>b</em> that minimizes our error is somewhere between [30, 40]. Unfortunately, simply iterating, with a pre-defined step, in order to find this minimum is very expensive and time consuming.</p><p>But how can we then compute this minimum of our cost function more cleverly? We&#x2019;ll then be using the <strong>Gradient Descent</strong> algorithm.</p><h2 id="gradient-descent">Gradient Descent</h2><p>The gradient descent is an iterative optimization algorithm that allows us to find the local minimums of a specific function.</p><p>A very nice example to explain the logic behind this algorithm, and recurrent in literature, is the one of the <strong>blind alpinist</strong>. Let us imagine that a blind alpinist wants to climb to the exact top of the mountain, with the least number of steps possible:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*jfIsd1WuxidQ4qoBxvmIMQ.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 9) Sequence of steps a smart blind alpinist would take to climb a mountain</span></figcaption></figure><p>As the alpinist is blind, he will be assessing about the inclination of his current position in order to choose the magnitude of the next step he should take:</p><ul><li>if the inclination of the mountain (slope) of his current position is high, he can safely take a big step <em>(as we can notice for example on the transition from the 1st step to the 2nd)</em></li><li>when the slope is getting smaller, as he reaches the top of the mountain, he knows he needs to take smaller steps in order to reach approximately to the exact higher point (<em>as the alpinist is getting closer to the top, from the 6th to the 7th step he is more careful on how to increase his position)</em></li><li>for positive slopes he needs to keep on going upwards to the top</li><li>for negative slopes, he is going down, so he needs to go back to the top</li></ul><p></p><p>The slope of the &#x201C;mountain&#x201D; at a given point, is then given by the derivative of that function on a specific point:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*qcq33VVWaDwzLz57pYjLpg.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 10) Slope of the function at a given point</span></figcaption></figure><p>Therefore, by computing the derivative of our &#x201C;mountain-function&#x201D; at a certain point we can then infer about the nature of the step that we&#x2019;ll be needing in order to properly reach our local minimum. By reaching a slope close to 0 (the yellow slope on the top, while compared to the blue slope value on the beginning of the mountain).</p><p>It is also trivial to understand that the same is valid for the convex version of this function, by switching the alpinist challenge from reaching the top of a mountain to actually reach the bottom of a valley:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*ahWt8YqQnsAOtBmiRswGZw.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 11) Iterative finding of our local minimum for a convex function (or a valley in this alpine example)</span></figcaption></figure><p>Going back to our case of study, and taking into account that we want to to properly estimate the values of the slope and the intercept that minimize the Cost, we can then use this concept to minimize the convex function that is in fact our Cost Function.</p><p>For simplicity, let us initially just try to predict the actual value of the intercept by still keeping our slope fixed at <em>m = 3</em>. Our cost function would then be:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:624/0*SGTZhI6F84QChNwG" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="624" height="120"></figure><p>As we now know the equation of this curve, we can take its derivative and determine the slope of it at any value of the <em>intercept</em>.</p><p>Let us now compute the derivative of our cost function, in terms of our intercept by using the chain rule:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:625/0*XOKmg3F0icFnX_gz" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="625" height="157"></figure><p>Now that we have properly computed the derivative we may now use the gradient descent to find where our Cost Function has its local minimum.</p><p>It would be indeed trivial to compute this specific minimum by finding the place where the derivative (slope) would be <strong><em>dE(b)/db = 0</em></strong>. Nevertheless this is not possible in many computational problems. Therefore we will apply the Gradient Descent to, starting from an initial guess, learn about the nature of this minimum. This versatility when we are unable to compute the derivative is in fact what makes this optimization algorithm so useful in so many contexts, such as modern machine learning problems.</p><h2 id="learning-the-proper-value">Learning the proper value</h2><p>Now that we have our derivative function, let us first compute the slope for a random value if the intercept <strong><em>b</em></strong>, such as:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:135/0*v876htta1Dp3ai4e" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="135" height="58"></figure><p>With this we know that, when the intercept is 0, the slope of the tangential line on this point, on our cost function is then <strong>-69</strong>. As soon as we approach the minimum of the function, this slope would then be also close to 0.</p><p>From our alpine example we understood that the size of the step we should take should be somehow related with the slope at a given point. This has the <strong>objective of giving &#x201C;bigger&#x201D; steps when the slope is higher and we are far from the minimum, and giving &#x201C;smaller&#x201D; steps when we are getting closer to a null slope</strong>.</p><p>As we are doing this process iteratively, just us adopt the image we describe above with the actual required step sizes and adjust them on each iteration. To the constant that we will use to actually adapt the step size we call the <strong>learning rate</strong>. With this idea in mind we can defined then the following expression to generate and adapt our step size, on every iteration:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:413/0*MX-N6rz4bALaR_C_" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="413" height="29"></figure><p>Let us assume a <strong>learning rate</strong> of 0.2, we would then obtain the following Step Size:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:359/0*pTxYjDtsLkLf1pNU" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="359" height="29"></figure><p>Taking into account our new step size, we can safely compute the next iteration <em>intercept</em> as being the actual step size:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:273/0*fk5ZGBI9HLfVzUoq" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="273" height="29"></figure><p>So for our first iteration we have:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:254/0*xXFSJ3KpooWggllI" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="254" height="29"></figure><p>For this new intercept value we can see that our slope, for the error function, is then given by:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:161/0*EJSq0psKUE-wCNpQ" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="161" height="50"></figure><p>As the slope is closer to 0 we can then understand that we are actually moving closer to the optimal value just by doing a first iteration. By revisiting figure 6 we can indeed infer that by increasing the intercept from 0 to a bigger value we are indeed reducing the residual error between our estimates and the actual observations.</p><p>Doing a couple of iterations we obtain:</p><p><strong>Step Size(2)</strong> = -41.4 * 0.2 = -8.28<br><strong>b(2)</strong> = 13.8 - (-8.28) = 22.08<br><strong>dE(22.08)/db</strong> = -24.8<strong>Step Size(3)</strong> = -24.8 * 0.2 = -4.96<br><strong>b(3) </strong>= 22.08 - (-4.96) = 27.04<br><strong>dE(27.04)/db</strong> = -14.8<strong>Step Size(4)</strong> = -14.8 * 0.2 = -2.96<br><strong>b(4)/db</strong> = 27.04 - (-2.96)= 30<br><strong>dE(30)/db</strong> = -9</p><p>We can verify from this 3 iterations the following:</p><ul><li>every step we are approaching a smaller absolute slope</li><li>as we approach a 0 slope we are doing smaller steps, by keeping the same learning rate</li></ul><p>And visually we can see that for each iteration we are getting closer to a prediction line of our data-set, and the steps are actually getting smaller:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*35pTtGAPX4tukA8eg3OvWw.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="508"><figcaption><span style="white-space: pre-wrap;">Figure 13) Applying the iteration values from the gradient descent for the intercept</span></figcaption></figure><p>In order to properly stop the iterations, to a certain acceptable value one should:</p><ul><li>decide what would be the minimal step size per iteration, e.g., stop if the step size is smaller than 0.001</li><li>stop when we reach a certain number of iterations</li></ul><p>By applying these rules we can verify that the algorithm stops a few iterations later:</p><p><strong>[+] Iteration 5:Step Size</strong> = -0.592<br><strong>b = </strong>30.592<br><strong>dE/db</strong> = -7.8160000000000025<strong>[+] Iteration 6:Step Size</strong> = -0.1184<br><strong>b = </strong> 30.7104<br><strong>dE/db</strong> = -7.579200000000007<em>(...)</em><strong>[+] Iteration 9:Step Size</strong> = -0.0009472000000000001<br><strong>b = </strong>30.7397632<br><strong>dE/db</strong> = -7.5204736000000025</p><p>Stabilizing with an intercept of <strong><em>b = 30.7397632</em></strong>. When plotted we obtain:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*GQofWpfbV9a_ULIiH-glyw.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 14) Using the stabilized predicted value given for a fixed slope to our intercept</span></figcaption></figure><p>With this approach we could verify that, by iterating progressively (at the pace of adapting the step size based on a learning cave), we could indeed approach towards the minimization of the error cost function, obtaining as plotted, a very more close model representation of the evolution. This was done by simply trying to predict one of the parameters, the intercept. On the following section we will then try to understand the evolution of this model with both our variables.</p><h2 id="moving-into-the-new-dimension">Moving into the new dimension</h2><p>Now that we have learned how to estimate the intercept value for our model, let us now move a step outside our one dimension and apply the gradient descent to both the <strong>intercept</strong> and the <strong>slope</strong>.</p><p>First, as on the previous section, we will then compute the derivative of our cost function, in terms of our <strong>intercept</strong>, by using the chain rule:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:646/0*CLldvN0DtJemPWTj" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="646" height="163"></figure><p>Now we can proceed on finding the partial derivative of our Error function in terms of the slope:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:647/0*wjkpXBFceA9cw07f" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="647" height="240"></figure><p>To the set of partial derivatives, to all the dimensions of this function we call then the <strong>Gradient:</strong></p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:211/0*tNQPg2LZvV9zMJ6y" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="211" height="60"></figure><p></p><p>We will then use this gradient, such as on the previous section to then find the local minimum of our error function. This is the reason behind calling this algorithm <strong>Gradient Descent</strong>.</p><p>In order to do it so, we need then to extrapolate what we had done for the intersect on the previous section to actually predict both values, and adjusting their own inter-dependency. Exposing it as such:</p><ul><li>As we are now approaching two variables, the problem could be again comparable to climbing a mountain, but with an extra dimension of complexity. You would need to adapt the pace for both the feet movement on the wall, but also a distinct one that would dictate the pace on the hands grip movement. Therefore, keeping the <strong>samelearning rate</strong>, we would then need to adapt two step sizes, one for the slope and anotherfor the intersect:</li></ul><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:405/0*ZfWiouTAKKwQDT3T" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="405" height="91"></figure><ul><li>With the new step sizes we could then obtain the current prediction for both variables, per iteration <em>n</em>:</li></ul><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:511/0*7lWXbahpsGR9rHdZ" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="511" height="63"></figure><ul><li>We would then compute again the gradient (namely the derivative for both variables, with the updated values):</li></ul><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:453/0*IsJoGuHzK1FahqF2" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="453" height="63"></figure><ul><li>Repeat the whole process until we reach a choosen limit for the iterative process. We will keep on using a limit on the step size.</li></ul><p>In order to implement this small algorithm we will need also to tweak all the initial values. A proper limit to stop value should also be decided. Let us, for our example decide on:</p><ul><li>the initial <strong>intersect</strong> as <strong><em>b = 30</em></strong></li><li>the initial <strong>slope </strong>as <strong><em>m = 3</em></strong></li><li>our <strong>learning rate</strong> will then be <strong><em>0.001</em></strong></li><li>and we will stop when the learning rate reaches <strong>0.00001</strong></li></ul><p>This can be represented by this simplistic python script:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/3cee9e9fab6833248c9ded5f8e68b4c5" allowfullscreen frameborder="0" height="722" width="680" title="gradient_descent_simple_regression.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Running this simple script we obtain the following output:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/9108a4d4cbd90c8249fad94a12652bee" allowfullscreen frameborder="0" height="370" width="680" title="Simple regression result" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Obtaining then the following predicted values, for both our variables:</p><ul><li><strong>Slope (m)= 2.5101</strong></li><li><strong>Intersect (b) = 40.5478</strong></li></ul><p>Applying these values to our prior plots containing the error spread we obtain the following:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*F8sT3KpwZgBvbXt1GzGfEw.png" class="kg-image" alt="A gentle introduction to gradient descent thru linear regression" loading="lazy" width="700" height="525"><figcaption><span style="white-space: pre-wrap;">Figure 15) Using the stabilized predicted value, based on the computed slope and intercept, compared with the initial value</span></figcaption></figure><p>We can then see from the figure above that our (linear) prediction is now way more close to predicting the modeled data, and we could then use our new model to actually infer about the relation between these two parameters.</p><h1 id="conclusion">Conclusion</h1><p>We can try to make an initial prevision about the natural relation of a set of parameters, and even obtain a simplistic model of evolution, by using <strong>Linear Regression</strong>. On this article we tried to then use this simple numerical method as a way to clearly expose the basic functioning of the <strong>Gradient Descent</strong> algorithm and mainly how we <strong>can achieve an iterative optimization of a prediction by trying to minimize a convergent error function</strong>.</p><p>Even though its owned concepts may present themselves as very simplistic, both conceptually and mathematically, they serve as one of the foundational basis to deep learning and neural networks.</p>]]></content:encoded></item><item><title><![CDATA[Predicting AirBnB prices in Lisbon: Trees and Random Forests]]></title><description><![CDATA[<p>In this small article, we will quickly bootstrap a prediction model for the nightly prices of an AirBnB in Lisbon. This guide hopes to serve as a simplistic and practical introduction to machine learning data analysis, by using real data and developing a real model.</p><p>It assumes as well a</p>]]></description><link>https://jose.tapadas.dev/predicting-airbnb-prices-in-lisbon-trees-and-random-forests/</link><guid isPermaLink="false">65d766d9c7a18807246feff7</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Mon, 16 Mar 2020 15:23:00 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1585208798174-6cedd86e019a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDJ8fGxpc2JvbnxlbnwwfHx8fDE3MDg2MTU0NjB8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1585208798174-6cedd86e019a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDJ8fGxpc2JvbnxlbnwwfHx8fDE3MDg2MTU0NjB8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests"><p>In this small article, we will quickly bootstrap a prediction model for the nightly prices of an AirBnB in Lisbon. This guide hopes to serve as a simplistic and practical introduction to machine learning data analysis, by using real data and developing a real model.</p><p>It assumes as well a basic understanding of Python and the machine learning library <a href="https://scikit-learn.org/stable/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">scikit-learn</a>, and it was written on a Jupyter notebook running Python 3.6 and sklearn 0.21. The dataset, as well as the notebook, can be obtained on my <a href="https://github.com/josetapadas/airbnb-lisbon-model-trees?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Github account</a>, or via <a href="https://datasetsearch.research.google.com/search?query=lisbon+airbnb&amp;docid=c6zMqHvIlOwEwlHEAAAAAA%3D%3D&amp;ref=jose.tapadas.dev" rel="noopener ugc nofollow">Google&#x2019;s dataset search</a>.</p><h1 id="1-data-exploration-and-cleanup">1. Data exploration and cleanup</h1><p>As the first step, we start by loading our dataset. After downloading the file it is trivial to open and parse it with Pandas and provide a quick list of what we could expect from it:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/0d23b5fb7550859fd4eb7ed5da2c734b" allowfullscreen frameborder="0" height="172" width="680" title="t1.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Index([&apos;room_id&apos;, &apos;survey_id&apos;, &apos;host_id&apos;, &apos;room_type&apos;, &apos;country&apos;, &apos;city&apos;, &apos;borough&apos;, &apos;neighborhood&apos;, &apos;reviews&apos;, &apos;overall_satisfaction&apos;, &apos;accommodates&apos;, &apos;bedrooms&apos;, &apos;bathrooms&apos;, &apos;price&apos;, &apos;minstay&apos;, &apos;name&apos;, &apos;last_modified&apos;, &apos;latitude&apos;, &apos;longitude&apos;, &apos;location&apos;], dtype=&apos;object&apos;)</p><p>Even though above we can confirm that the dataset was properly loaded and parsed, a quick analysis of the statistical description of the data may provide us with a quick insight of its nature:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/fb96abce656cf22635979ce398bf990f" allowfullscreen frameborder="0" height="62" width="680" title="t2.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:1000/1*0OhfqtAo2gAr70OCim3--w.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="1000" height="204"></figure><p>From this table, we can actually infer about basic statistical observations for each of our parameters. As our model intends to predict the price, based on whatever set of inputs we&#x2019;ll provide to it, we could check for example that:</p><ul><li>the mean value of the nightly price is around <strong>88 EUR</strong></li><li>the prices range from a minimum of <strong>10 EUR</strong> to <strong>4203 EUR</strong></li><li>the standard deviation for the prices is around <strong>123 EUR</strong> (!)</li></ul><p>The price distribution could be represented as follows:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/f7ddf59491d98bb4f97f6fefa9d3c048" allowfullscreen frameborder="0" height="150" width="680" title="t3.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:388/0*OAtkBHlLPNbrNOc4.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="388" height="266"></figure><p>As we can see, our distribution of prices concentrates, under the <strong>300 EUR</strong> interval, having some entries for the <strong>4000 EUR</strong> values. Plotting it for where most of the prices reside:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/e0bbce7b7ccce9374024c14f5700aeff" allowfullscreen frameborder="0" height="84" width="680" title="t4.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:394/0*-kMjidTqk_TfPeYw.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="394" height="266"></figure><p>We can clearly see from our representation above that most of the prices, for a night in Lisbon, will cost between <strong>0&#x2013;150 EUR</strong>.</p><p>Let us now pry and have a sneak peak into the actual dataset, in order to understand the kind of parameters we&#x2019;ll be working on:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/523eb294aa48bf0aa489cee57f14c93b" allowfullscreen frameborder="0" height="62" width="680" title="t5.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:1000/1*sK_Ut2wu3EfDeCCgjh64Zw.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="1000" height="298"></figure><p>From the description above, we should be able to infer some statistical data about the nature of the data. Besides the distribution set of parameters (that we will not be looking for now), we clearly identify two sets of relevant insights:</p><ul><li>there are empty columns: <code>country</code>, <code>borough</code>, <code>bathrooms</code>, <code>minstay</code></li><li>entries like <code>host_id</code>, <code>survey_id</code>, <code>room_id</code>, <code>name</code>, <code>city</code>, <code>last_modified</code> and <code>survey_id</code> may not be so relevant for our price predictor</li><li>there are some categorical data that we will not be able to initially add to the regression of the Price, such as <code>room_type</code> and <code>neighborhood</code> (but we&apos;ll be back to these two later on)</li><li><code>location</code> may be redundant for now, when we have both <code>latitude</code> and <code>longitude</code> and we may need to further infer about the nature of the format of this field</li></ul><p>Let us then proceed on separating the dataset in:</p><ul><li>one vector <strong>Y</strong> that will contain all the real prices of the dataset</li><li>on matrix <strong>X</strong> that contains all the features that we consider relevant for our model</li></ul><p>This can be achieved by the following snippet:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/ea742e3aa4a0bb35c94e6a0e47747cbd" allowfullscreen frameborder="0" height="172" width="680" title="t6.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:517/1*EqjpaZ9D6qY9vvH3d5Qa-A.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="517" height="166"></figure><p>With our new subset, we can now try to understand what is the correlation of these parameters in terms of the overall satisfaction, for the most common price range:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/470141fc6ed6c0e7a7fbde2603685cf5" allowfullscreen frameborder="0" height="106" width="680" title="t7.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:700/0*CR4yUy6V0brtrD78.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="700" height="636"></figure><p>The above plots allow us to check the distribution of all the single variables and try to infer the relationships between them. We&#x2019;ve taken the freedom to apply a color hue based on the review values for each of our chosen parameters. Some easy reading examples of the above figure, from relationships that may denote a positive correlation:</p><ul><li>the number of reviews is more common for rooms with few accommodations. This could mean that most of the guests that review are renting smaller rooms.</li><li>most of the reviews are made for the cheaper priced rooms</li><li>taking into account the visual dominance of the yellow hue, most of the reviews are actually rated with 5. Either this means that most of the accommodations are actually very satisfactory or, most probably, the large number of people that actually review, do it to give a 5 as the rating.</li></ul><p>One curious observation is also that the location heavily influences price and rating. When plotting both longitude and latitude we can obtain a quasi geographical/spacial distribution for the ratings along Lisbon:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/9240414859e51074707056abacd8ce7b" allowfullscreen frameborder="0" height="62" width="680" title="t8.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:538/0*PyZJDKOLGKo8ezb3.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="538" height="424"></figure><p>We can then add this data to an actual map of Lisbon, to check the distribution:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:500/0*hl_sQAYpiOzQpBL1.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="500" height="517"></figure><p>As expected, most of the reviews are on the city center with a cluster of reviews already relevant alongside the recent Parque das Na&#xE7;&#xF5;es. The northern more sub-urban area, even though it has some scattered places, the reviews are not as high and common as on the center.</p><h1 id="2-splitting-the-dataset">2. Splitting the dataset</h1><p>With our dataset now properly cleared we will then first proceed into splitting it into two pieces:</p><ul><li>a set that will be responsible for training our model, therefore called the <strong>training set</strong></li><li>a <strong>validation set</strong> that will be used to then validate our model</li></ul><p>Both sets would then basically be a subset of <strong>X</strong> and <strong>Y</strong>, containing a subset of the rental spaces and their corresponding prices. We would then, after training our model, to use the validation set as a input to then infer how good is our model on generalizing into data sets other than the ones used to train. When a model is performing very well on the training set, but does not generalize well to other data, we say that the model is <strong>overfitted</strong> to the dataset.</p><p>For deeper information on overfitting, please refer to <a href="https://en.wikipedia.org/wiki/Overfitting?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://en.wikipedia.org/wiki/Overfitting</a></p><p>In order to avoid this <strong>overfitting</strong> of our model to the test data we will then use a tool from sklearn called <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a> that basically will split our data into a random train of train and test subsets:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/4a74d9b347dd46c99eecedf689023068" allowfullscreen frameborder="0" height="194" width="680" title="t9.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Training set: Xt:(10183, 6) Yt:(10183,) <br>Validation set: Xv:(3395, 6) Yv:(3395,) <br>- <br>Full dataset: X:(13578, 6) Y:(13578,)</p><p>Now that we have our datasets in place, we can now proceed on creating a simple regression model that will try to predict, based on our chosen parameters, the nightly cost of an AirBnb in Lisbon.</p><h1 id="3-planting-the-decision-trees">3. Planting the Decision Trees</h1><p>As one of the most simplistic supervised ML models, a decision tree is usually used to predict an outcome by learning and inferring decision rules from all the features data available. By ingesting our data parameters the trees can learn a series of educated &#x201C;questions&#x201D; in order to partition our data in a way that we can use the resulting data structure to either classify categorical data or simply create a regression model for numerical values (as it is our case with the prices).</p><p>A visualization example, taken from <a href="https://en.wikipedia.org/wiki/Decision_tree_learning?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Wikipedia</a>, could be the decision tree around the prediction for the survival of passengers in the Titanic:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:360/0*bpxepHwlE4EQaWiM.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="360" height="340"></figure><p>Based on the data, the tree is built on the root and will be (recursively) partitioned by splitting each node into two child ones. These resulting nodes will be split, based on decisions that are inferred about the statistical data we are providing to the model, until we reach a point where the data split results in the biggest information gain, meaning we can properly classify all the samples based on the classes we are iteratively creating. The end vertices we call &#x201C;leaves&#x201D;.</p><p>On the Wikipedia example above it is trivial to follow how the decision process follows and, as the probability of survival is the estimated parameter here, we can easily obtain the probability of a &#x201C;male, with more than 9.5 years old&#x201D; survives when &#x201C;he has no siblings&#x201D;.</p><p>(For a deeper understanding of how decision trees are built for regression, I would recommend the video by <a href="https://statquest.org/2018/01/22/statquest-decision-trees/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">StatQuest, named Decision Trees</a>).</p><p>Let us then create our Decision Tree regression by utilizing the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html?highlight=decisiontreeregressor&amp;ref=jose.tapadas.dev#sklearn.tree.DecisionTreeRegressor" rel="noopener ugc nofollow">sklearn implementation</a>:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/8395430d54d094863e7e7400443808c1" allowfullscreen frameborder="0" height="106" width="680" title="t10.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>DecisionTreeRegressor(criterion=&apos;mse&apos;, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter=&apos;best&apos;)</p><p>We can verify how the tree was built, for illustration purposes, on the picture below:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/91106e393458b5f0a34bf2010849695f" allowfullscreen frameborder="0" height="260" width="680" title="t11.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Please <a href="https://github.com/josetapadas/airbnb-lisbon-model-trees/blob/master/output_24_0.png?ref=jose.tapadas.dev" rel="noopener ugc nofollow">find here a graphical representation of the generated tree</a> @ Github.</p><p>We can also show a snippet of the predictions, and corresponding parameters for a sample of the training data set. So for the following accommodations:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/42806c2aad28b910324fe82ce27cac1f" allowfullscreen frameborder="0" height="62" width="680" title="t12.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:540/1*FIBrv0qEzPKtNefEHEXPBw.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="540" height="168"></figure><p>We obtain the following prices:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/55db8571380fd74e91c05d554cb8d614" allowfullscreen frameborder="0" height="62" width="680" title="t13.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>array([ 30., 81., 60., 30., 121.])</p><p>After fitting our model to the train data, we can now run a prediction for the validation set and assess the current absolute error of our model to assess on how well it generalizes when not run against the data it was tested.</p><p>For this, we&#x2019;ll use the <strong>Mean Absolute Error</strong> (MAE) metric. We can consider this metric as the average error magnitude in a predictions set. It can be represented as such:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:195/1*-DjPfVosxmxHUAlDE6hUOQ.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="195" height="69"></figure><p>It is basically an average over the differences between our model predictions (y) and actual observations ( <em>y-hat</em>), making the consideration that all individual differences have equal weight.</p><p>Let us then apply this metric to our model, using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Scikit Learn</a> implementation:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/d4f8fa3a8e38931bcb47fb8bc5a5393e" allowfullscreen frameborder="0" height="238" width="680" title="t14.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>42.91664212076583</p><p>This result basically means that our model is giving an absolute error of about <strong>42.935 EUR</strong> per accommodation when exposed to the test data, out of a <strong>88.38 EUR</strong> mean value that we collected during the initial data exploration.</p><p>Either due to our dataset being small or to our model being naive, this result is not satisfactory.</p><p>Even though this may seem worrying at this point, it is always advised to create a model that generates results as soon as possible and then start iterating on its optimization. Therefore, let us now proceed on attempting to improve our model&#x2019;s predictions a bit more.</p><p>Currently, we are indeed suffering for overfitting on the test data. If we imagine the decision tree that is being built, as we are not specifying a limit for the decisions to split, we will consequently generate a decision tree that goes way deep until the test features, not generalizing well on any test set.</p><p>As sklearn&#x2019;s <code>DecisionTreeRegressor</code> allows us to specify a maximum number of leaf nodes as a hyperparameter, let us quickly try to assess if there is a value that decreases our MAE:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/791ad983e10ee8c390e7c377845b1f78" allowfullscreen frameborder="0" height="678" width="680" title="t15.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>(Size: 5, MAE: 42.6016036138866) <br>(Size: 10, MAE: 40.951013502542885) <br>(Size: 20, MAE: 40.00407688450048) <br>(Size: 30, MAE: 39.6249335490541) <br>(Size: 50, MAE: 39.038730827750555) <br>(Size: 100, MAE: 37.72578309289501) <br>(Size: 250, MAE: 36.82474862034445) <br>(Size: 500, MAE: 37.58889602439078) 250</p><p>Let us then try to generate our model, but including the computed max tree size, and check then its prediction with the new limit:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/a2dbbe445fd31580d863a351b10c9968" allowfullscreen frameborder="0" height="238" width="680" title="t16.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>36.82474862034445</p><p>So by simply tuning up our maximum number of leaf nodes hyper-parameter we could then obtain a significant increase of our model&#x2019;s predictions. We have now improved on average ( <code>42.935 - 36.825</code>) **~ 6.11 EUR** on our model&apos;s errors.</p><h1 id="4-categorical-data">4. Categorical Data</h1><p>As mentioned above, even though we are being able to proceed on optimizing our very simplistic model, we still dropped two possible relevant fields that may (or may not) contribute to a better generalization and parameterization of our model: <code>room_type</code> and <code>neighborhood</code>.</p><p>These non-numerical data fields are usually referred to as <strong>Categorical Data</strong>, and most frequently we can approach them in three ways:</p><p><strong>1) Drop</strong></p><p>Sometimes the easiest way to deal with categorical data is&#x2026; to remove it from the dataset. We did this to set up our project quickly, but one must go case by case in order to infer about the nature of such fields and if they make sense to be dropped.</p><p>This was the scenario we analysed until now, with a MAE of: <strong>36.82474862034445</strong></p><p><strong>2) Label Encoding</strong></p><p>So for label encoding, we assume that each value is assigned to a unique integer. We can also make this transformation taking into account any kind of order/magnitude that may be relevant for data (e.g., ratings, views, &#x2026;). Let us check a simple example using the sklearn preprocessor:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/62b9898f7921a7dc4ca6f0288345310d" allowfullscreen frameborder="0" height="260" width="680" title="t17.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>array([3, 3, 1, 0, 2])</p><p>It is trivial to assess then the transformation that the <code>LabelEncoder</code> is doing, by assigning the array index of the fitted data:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/fd2a7d16e7d0bdcd46f3b0370ba4fe16" allowfullscreen frameborder="0" height="62" width="680" title="t18.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>array([&apos;double room&apos;, &apos;shared room&apos;, &apos;single room&apos;, &apos;suite&apos;], dtype=&apos;&lt;U11&apos;)</p><p>Let us then apply to our categorical data this preprocessing technique, and let us verify how this affects our model predictions. So our new data set would be:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/b4a1bcfc0de4077c4ecaaf9ffb415457" allowfullscreen frameborder="0" height="128" width="680" title="t19.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:700/1*qq6HfSEgbEeKAYqvTv1Vbw.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="700" height="164"></figure><p>Our categorical data, represented on our panda&#x2019;s dataframe as an <code>object</code>, can then be extracted by:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/344c1143394b96934b844551dd9f9fc6" allowfullscreen frameborder="0" height="128" width="680" title="t20.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>[&apos;room_type&apos;, &apos;neighborhood&apos;]</p><p>Now that we have the columns, let us then transform them on both the training and validation sets:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/bb59d7443ac9037e253855711063ac3a" allowfullscreen frameborder="0" height="370" width="680" title="t21.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:700/1*A76HrEgeWUUtNCURXJh9oA.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="700" height="168"></figure><p>Let us now train and fit the model with the transformed data:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/979aa7c706878c2004e55e791b298250" allowfullscreen frameborder="0" height="304" width="680" title="t23.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>35.690195084932355</p><p>We have then improved our predictor, by encoding our categorical data, reducing our MAE to <strong>~ 35.69 EUR</strong>.</p><p><strong>3) One-Hot Encoding</strong></p><p>One-Hot encoding, instead of enumerating a fields&#x2019; possible values, create new columns indicating the presence or absence of the encoded values. Let us showcase this with a small example:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/f102fc1c018f1204aba7db6b22d19916" allowfullscreen frameborder="0" height="282" width="680" title="t24.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>array([[0., 0., 0., 1., 0., 0., 1.], [0., 1., 0., 0., 0., 1., 0.]])</p><p>From the result above we can see that the binary encoding is providing <code>1</code> on the features that each feature array actually has enabled, and <code>0</code> when not present. Let us then try to use this preprocessing on our model:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/9e684b1a800cf1b4a6daad843958e402" allowfullscreen frameborder="0" height="194" width="680" title="t25.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:628/1*_ek4d5Ox3NEDlfOJ1WXSog.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="628" height="198"></figure><p>So the above result may look weird at first but, for the 26 possible categories, we now have a binary codification checking for its presence. We will now:</p><ul><li>add back the original row indexes that were lost during the transformation</li><li>drop the original categorical columns from the original sets <code>train_X</code> and <code>validation_X</code></li><li>replace the dropped columns by our new dataframe with all 26 possible categories</li></ul><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/2b440db9452cdbfbfe837fdffbd03bcd" allowfullscreen frameborder="0" height="282" width="680" title="t26.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:700/1*k9d83hBnIs5UU3HNZMHRaA.png" class="kg-image" alt="Predicting AirBnB prices in Lisbon: Trees and Random Forests" loading="lazy" width="700" height="141"></figure><p>Now we can proceed on using our new encoded sets into our model:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/a220db2a01c272b60e448c0aba42dcac" allowfullscreen frameborder="0" height="304" width="680" title="t27.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>36.97010930367817</p><p>By using One Hot Encoding on our categorical data we obtain a MAE to ~ <strong>36.97EUR</strong>.</p><p>This result may prove that One-Hot-Encoding is not the best fit for both our categorical parameters when compared with the Label Encoding and for both parameters at the same time. Nevertheless, this result still allowed us to include the categorical parameters with a reduction of the initial MAE.</p><h1 id="5-random-forests">5. Random Forests</h1><p>From the previous section we could see that, with our Decision Tree, we are always balancing between:</p><ul><li>a deep tree with many leaves, in our case with few AirBnB places on each of them, being then too overfitted to our testing set (they present what we call <strong>high variance</strong>)</li><li>a shallow tree with few leaves that is unable to distinguish between the various features of an item</li></ul><p>We can imagine a &#x201C;Random Forest&#x201D; as an ensemble of Decision Trees that, in order to try to reduce the variance mentioned above, generates Trees in a way that will allow the algorithm to select the remaining trees in a way that the error is reduced. Some examples of how the random forest is created could be:</p><ul><li>generating trees with different subsets of data. For example, from our set of parameters analysed above, trees would be generated having only a random set of them (e.g., a Decision Tree with only &#x201C;reviews&#x201D; and &#x201C;bedrooms&#x201D;, another with all parameters except &#x201C;latitude&#x201D;</li><li>generating other trees by training on different samples of data (different sizes, different splits between the data set into training and validation, &#x2026;)</li></ul><p>In order to reduce the variance, the added randomness makes the generated individual trees&#x2019; errors less likely to be related. The prediction is then taken from the average of all predictions, by combining the different decision trees predictions, has the interesting effect of even canceling some of those errors out, reducing then the variance of the whole prediction.</p><p>The original publication, explaining this algorithm in more depth, can be found on the bibliography section at the end of this article.</p><p>Let us then implement our predictor using a Random Forest:</p><figure class="kg-card kg-embed-card"><iframe src="https://towardsdatascience.com/media/9aa512fdc10a5a82bd624a5b4585d921" allowfullscreen frameborder="0" height="524" width="680" title="t28.py" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>33.9996500736377</p><p>We can see that we have a significant reduction on our MAE when using a Random Forest.</p><h1 id="6-summary">6. Summary</h1><p>Even though decision trees are a very simplistic (maybe the most simple) regression techniques in machine learning that we can use in our models, we expected to demostrate a sample process of analysing a dataset in order to generate predictions. It was clear to demonstrate that with small optimization steps (like cleaning up the data, encoding categorical data) and abstracting a single tree to a random forest we could significantly reduce the mean absolute error from our model predictions.</p><p>We hope that this example becomes usefull as a hands-on experience with machine learning and please don&#x2019;t exitate to contact me if I could clarify or correct some of what was demonstrated above. There would also be the plan on proceeding on further optimizing our predictions on this specific dataset in some future articles, using other approaches and tools so keep tuned :)</p><h1 id="7-further-reading">7. Further reading</h1><p>Please find below some resources that are very usefull on understanding some of the exposed concepts:</p><ul><li>StatQuest, Decision Trees: <a href="https://statquest.org/2018/01/22/statquest-decision-trees/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://statquest.org/2018/01/22/statquest-decision-trees/</a></li><li>Bias&#x2013;variance tradeoff: <a href="https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff</a></li><li>Breiman, Random Forests, Machine Learning, 45(1), 5&#x2013;32, 2001: <a href="https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf</a></li></ul>]]></content:encoded></item><item><title><![CDATA[Setting up a simple Rails development environment with Docker for fun and profit]]></title><description><![CDATA[<p>reating a development environment may seem like a trivial task for many developers. As time progresses, and we find ourselves dwelling through the life cycle of so many projects, one probably ends up with a fragile and cluttered development machine, filled with an entropic set of unmanageable services and library</p>]]></description><link>https://jose.tapadas.dev/setting-up-a-simple-rails-development-environment-with-docker-for-fun-and-profit/</link><guid isPermaLink="false">65d7675cc7a18807246ff004</guid><dc:creator><![CDATA[José Tapadas Alves]]></dc:creator><pubDate>Wed, 25 Jan 2017 00:00:00 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1523427373578-fa4bbfc4389a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDV8fHJhaWxzfGVufDB8fHx8MTcwODYxNTUzMnww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1523427373578-fa4bbfc4389a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDV8fHJhaWxzfGVufDB8fHx8MTcwODYxNTUzMnww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Setting up a simple Rails development environment with Docker for fun and profit"><p>reating a development environment may seem like a trivial task for many developers. As time progresses, and we find ourselves dwelling through the life cycle of so many projects, one probably ends up with a fragile and cluttered development machine, filled with an entropic set of unmanageable services and library versions, ultimately getting to a point where things simply start to crack without any apparent reason.</p><p>With this small guide I hope to equip you with the set of tools and gears to create simple, manageable and isolated <em>production-like</em> development environments using Docker containers.</p><h1 id="the-plan">The Plan</h1><p>For this specific example we will create a full contained <a href="http://rubyonrails.org/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Ruby on Rails</a> development environment alongside with isolated common services it usually communicates with, namely: a <a href="https://www.postgresql.org/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">PostgreSQL</a> database, <a href="http://sidekiq.org/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Sidekiq</a> (and <a href="https://redis.io/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Redis</a> to support it).</p><p>For creating and managing the isolated Linux containers we will use a combination of:</p><ul><li><a href="https://www.docker.com/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Docker</a>: the tool that will allow us to run lightweight isolated containers for our app</li><li><a href="https://www.docker.com/products/docker-compose?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Docker Compose</a>: a tool that will help us manage multiple containers for the multiple services</li></ul><p>As this guide presents itself with a more simplistic practical approach, and goes beyond explaining the intricacies of Docker itself, please refer to the <a href="https://docs.docker.com/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Docker official documentation guides</a> if you feel the need to get a grasp of more specific information on any of the tools presented above.</p><h1 id="entering-docker">Entering Docker</h1><p>We will then start by creating a container to run our code using Docker.</p><p>Even though the concept may seem analogous to a Virtual Machine (VM), a container does not fully virtualize the whole hardware and OS stack as a standard VM does. The container will include the application and its whole dependancies, running its processes on an isolated userspace, but sharing the host kernel with other containers. These containers can be looked upon as lightweight VMs (<a href="https://blog.docker.com/2016/03/containers-are-not-vms/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">although they really aren&#x2019;t</a>) that provide a full virtual environment without the overhead that comes with booting up a separate kernel and simulating all the hardware.</p><p>To achieve this, Docker relies itself on both the Linux kernel and the <a href="https://en.wikipedia.org/wiki/LXC?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Linux Containers</a> (LXC) infrastructure.</p><h2 id="supervising-the-supervisor">Supervising the Supervisor</h2><p>Not all operating systems actually support isolated userspaces. Linux supports them, but OSX and Windows don&#x2019;t. So if you&#x2019;re on OSX or Windows, we will have to use a virtualization solution after all in order to boot a kernel that does support them.</p><p>Historically this was achieved by spinning a VM on VirtualBox with a <a href="http://boot2docker.io/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">tiny Linux distribution</a> on it to host the containers. Since <a href="https://blog.docker.com/2016/06/docker-mac-windows-public-beta/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">last June,</a> the Docker Team dropped VirtualBox leveraging both <a href="https://github.com/docker/HyperKit/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">HyperKit</a>, a lightweight macOS virtualization solution built on top of the native <a href="https://developer.apple.com/reference/hypervisor?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Hypervisor.framework</a> (introduced on macOS 10.10), and for Windows it now uses the <a href="https://technet.microsoft.com/en-us/library/mt169373(v=ws.11).aspx?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Microsoft Hyper-V</a> solution.</p><p>For this particular example we will be working on a macOS machine so let us start by downloading:</p><ul><li><a href="https://www.docker.com/products/docker?ref=jose.tapadas.dev#/mac" rel="noopener ugc nofollow">Docker for Mac</a> : <em>Docker for Mac is a native Mac application architected from scratch, with a native user interface and auto-update capability, deeply integrated with OS X native virtualization, Hypervisor Framework.</em></li></ul><h2 id="creating-a-new-image">Creating a new Image</h2><p>The way Docker images are managed is a bit like projects are managed via git on GitHub. There is a public collection of open source images at the <a href="https://hub.docker.com/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Docker Hub</a> where we can <code>docker pull</code> existent images or <code>docker push </code>our contributions and custom configurations.</p><p>After having Docker up and running on our machine we can now start writing the recipe that will instruct it to build our image and the contained environment we intend to be working with. This is achieved by creating a <code>Dockerfile</code> on the root of our project:</p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/4e398db14d551fde11940bc9cac0bf63" allowfullscreen frameborder="0" height="546" width="680" title="Sample Dockerfile" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Let us then go to through this simple configuration. The first line simply states that we will base this environment on a lightweight Ruby image from the <a href="https://hub.docker.com/_/ruby/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">official Ruby repository</a> named <code>ruby:2.3-slim</code> (you can actually check its own <a href="https://github.com/docker-library/ruby/blob/6e2934a351adb67fd95aed1a669ad54d758834a0/2.3/slim/Dockerfile?ref=jose.tapadas.dev" rel="noopener ugc nofollow"><code>Dockerfile</code></a>):</p><p>FROM ruby:2.3-slim</p><p>Next we have a run list to install all the dependencies we need to have a basic development environment. You need this because the <code>ruby:2.3-slim</code>image is very minimal, and doesn&#x2019;t contain them out of the box:</p><p>RUN apt-get update &amp;&amp; apt-get install -qq -y &#x2014; no-install-recommends build-essential nodejs libpq-dev git tzdata libxml2-dev libxslt-dev ssh &amp;&amp; rm -rf /var/lib/apt/lists/*</p><ul><li>the <a href="http://packages.ubuntu.com/precise/build-essential?ref=jose.tapadas.dev" rel="noopener ugc nofollow"><em>build-essential</em></a> package to have the GNU C compilers, GNU C Library, Make and standard Debian package building tools</li><li><em>nodejs</em> as our choice for a JavaScript runtime for the asset pipeline</li><li><em>libpq-dev</em> is the programmer&#x2019;s interface to PostgreSQL</li><li><em>tzdata</em> as a dependancy for the <a href="https://github.com/tzinfo/tzinfo?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Ruby Timezone Library</a></li><li><a href="http://xmlsoft.org/?ref=jose.tapadas.dev" rel="noopener ugc nofollow"><em>libxml2-dev</em></a> and <a href="http://xmlsoft.org/libxslt/?ref=jose.tapadas.dev" rel="noopener ugc nofollow"><em>libxslt-dev</em></a> to build <a href="https://github.com/sparklemotion/nokogiri?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Nokogiri</a></li><li><em>ssh</em> and <em>git</em> as two essential tools for any sane developer</li></ul><p>Busting up the apt cache and removing the contents of <code>/var/lib/apt/lists</code> helps us keep the image size down.</p><p>The next block sets our working directory on an environment variable and creates the folder that will accommodate our Rails app. The <code>WORKDIR</code>instruction basically sets the working directory for any <code>RUN</code>, <code>CMD</code>, <code>ENTRYPOINT</code>, <code>COPY</code> and <code>ADD</code> instructions that follow it in the <code>Dockerfile</code>.</p><p>ENV APP_HOME /opt/fooapp<br>RUN mkdir -p $APP_HOME<br>WORKDIR $APP_HOME</p><p>Finally, as we will be vendoring our gems with Bundler on our vendor path <code>/opt/fooapp/vendor/bundle</code> we&#x2019;ll finish by setting the required environment variables:</p><p>ENV GEM_HOME /opt/fooapp/vendor/bundle<br>ENV PATH $GEM_HOME/bin:$PATH<br>ENV BUNDLE_PATH $GEM_HOME<br>ENV BUNDLE_BIN $BUNDLE_PATH/bin</p><p>Now that our recipe is complete, it is time to test building the container. This can be done by simply running:</p><p>$ docker build -t samplefooappimage .</p><p>Afterwards we can confirm that, by listing all the available images, that it was in fact created:</p><p>$ docker imageREPOSITORY        TAG       IMAGE ID       CREATED         SIZE<br>samplefooappimage latest    8915a11cb4c6   12 minutes ago  484.7 MB<br>ruby              2.3-slim  68e02bf2b853   7 days ago      273.8 MB</p><h2 id="composing-the-services-stack">Composing the Services Stack</h2><p>Now the we have properly built our base image it is time to assemble and configure the services set that our application will be using. For this purpose we will be using <a href="https://docs.docker.com/compose/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Docker Compose</a>, included on the installed Docker <em>toolbelt</em>. This tool will basically enable us to create and manage a multi-service, multi-container docker application.</p><p>We will start by creating an initial version of the configuration file with both our Rails app and PostgreSQL. On the root of our app, we edit a file named<code>docker-compose.yml:</code></p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/deda5d07926812c0c580a4f522269aa0" allowfullscreen frameborder="0" height="458" width="680" title="Rails and PostgreSQL simple docker-compose.yml" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>The configuration file is pretty straight forward. We are setting two service blocks. We will call <code>database</code> to the configuration to create and run PostgreSQL, and <code>web</code> for our Rails app. Please notice that Compose here will then create two separate Docker containers that will eventually communicate with each other, mimicking much more closely a real setup.</p><p>Let us now look a bit closer to each of the services&#x2019; configuration. For the <code>database</code> block:</p><ul><li>the <code>image</code> keyword simply specifies the image that will be used to start building the container. In this specific case we will be using the <a href="https://hub.docker.com/_/postgres/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">official PostgreSQL image</a>.</li><li>due to the volatile nature of Docker containers, we will need to persist the database data on our filesystem. To achieve that we are then specifying a mount point, from the host machine to the container, using the <code>volumes</code> keyword.</li><li>specify a filename containing a list of environment variables to be exported upon creation. For this example, the <code>env_file</code> will also be simple and contain the required PostgreSQL credentials info. We will then create it with some sample data:</li></ul><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/72ee5d3705230b6acdf61b6622b9ad97" allowfullscreen frameborder="0" height="84" width="680" title="Sample .env file" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>For the Rails <code>web</code> service configuration is also pretty straightforward:</p><ul><li>create a link with the <code>database </code>named service container. Using this <code>links</code> keyword will also make the linked service reachable using a hostname identical to service name, in this example: <code>database</code></li><li>specify the Dockerfile path to <code>build</code> the image</li><li>specify a mounting point for synchronising our app&#x2019;s code between our host machine and the container</li><li>expose, and map, the <code>3000</code> port on both the host and the container</li><li>specify the default <code>command</code> to run after the container is started</li><li>we will reuse the same <code>.env</code> file in order to have, on this container, the same configuration variables to simplify the database configuration</li></ul><p>Before we can start our two services we have to initialise the Rails app by installing the gems on the container and configuring the database. This will be the first command we will run directly inside the <code>web</code> service container:</p><p>$ docker-compose run --rm web bundle install</p><p>After installing the dependancies we should update the application&#x2019;s <code>config/database.yml</code> with the database configuration information we&#x2019;ve created on the<code>docker-compose.yml</code> file:</p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/130c7bba0aa83b2d202e4d018ce3b8f5" allowfullscreen frameborder="0" height="282" width="680" title="Sample dockerized database.yml" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>On this plain configuration file I would like you to notice two things:</p><ul><li>the host is using the network alias we&#x2019;ve specified on the compose link configuration</li><li>we are using the environment variables exported from our <code>.env</code> file</li></ul><p>Now it is just a matter of creating the database inside the container:</p><p>$ docker-compose run --rm web bundle exec rake db:create</p><p>And start both services (the optional<code>-d</code> flag runs the containers in a detached mode):</p><p>$ docker-compose up -d</p><p>You can check the state of your containers by running <code>docker-compose ps</code> and stop them eventually by running <code>docker-compose stop</code> :</p><p>$ docker-compose ps Name                Command            State  Ports<br>&#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; &#x2014; <br>fooapp_database_1   postgres           Up     5432/tcp<br>fooapp_web_1        bundle exec puma   Up     0.0.0.0:3000-&gt;3000/tcp</p><p>If you are using OSX or Windows, the containers can also be managed by a tool called <a href="https://kitematic.com/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Kitematic</a> (part of the Docker Toolbelt), if its installed, will also be available under the the Docker icon on the macOS&#x2019; menu bar:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://miro.medium.com/v2/resize:fit:700/1*d4DtkTl-t2cass7A64W5Kg.png" class="kg-image" alt="Setting up a simple Rails development environment with Docker for fun and profit" loading="lazy" width="700" height="444"><figcaption><span style="white-space: pre-wrap;">Kitematic showing both running containers</span></figcaption></figure><p>As puma is bound on the default route you can now easily access your app, outside the container, by navigating to <a href="http://0.0.0.0:3000/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">http://127.0.0.1:3000</a> as expected.</p><h2 id="adding-sidekiq-to-the-mix">Adding Sidekiq to the Mix</h2><p>Let us now build up on thispreliminarystack by spinning a new container with our codebase but for running Sidekiq. Toaddnew services we simply re-use the same simple configurations on our composition file.</p><p>As it require <em>Redis</em> to work, we simply add it by editing editing our docker-compose.yml file:</p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/c7ce561683e9adfcfb846c9aa32f7627" allowfullscreen frameborder="0" height="634" width="680" title="Adding redis to the composition" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>As you can see it is pretty straight forward :</p><ul><li>we use an existent official <em>Redis</em> image</li><li>we specify the port forwarding</li><li>we specify a mounting point to persist the data at the host machine</li><li>and we also link this new service to our rails app just in case</li></ul><p>After running <code>docker-compose up</code> you will now see that the new container is going to be built and started.</p><p>Finally we simple add a new container specifically for <em>Sidekiq</em>:</p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/f232ac7006b7f7bfc238c5e8a2a3ac4b" allowfullscreen frameborder="0" height="854" width="680" title="docker-compose file with Sidekiq" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>As you can see from the updated composition file we now have a <code>sidekiq</code> service, linked to both our database and redis (with a host alias to avoid confusion). We are also sharing our <code>.env</code> file as we need to also update it in order to reference our <code>REDIS_URL</code> so Sidekiq knows how to connect to it:</p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/ebc68a4d1d3703af3b5d06bd90a1fe73" allowfullscreen frameborder="0" height="106" width="680" title="Environment files with redis now" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Then we simple install the new app dependancies on the container directly, and restart our services:</p><p>$ docker-compose run --rm sidekiq bundle install<br>(...)<br>$ docker-compose restart</p><p>We will now see all services, with the new Sidekiq up and running:</p><figure class="kg-card kg-image-card"><img src="https://miro.medium.com/v2/resize:fit:700/1*67Oh73QZINlrDIg1dj3s-Q.png" class="kg-image" alt="Setting up a simple Rails development environment with Docker for fun and profit" loading="lazy" width="700" height="310"></figure><h2 id="debugging-tip-with-pry">Debugging tip with pry</h2><p>To enable us to use a tool like pry on this stack we will need to add a way to attach to a running container and being able to write to stdin. To do so we will add the compose configuration, for<code>tty</code>and<code>stdin_open, </code>accordingly,</p><figure class="kg-card kg-embed-card"><iframe src="https://revs.runtime-revolution.com/media/7f0eca3902937595982ba82837cc37dd" allowfullscreen frameborder="0" height="942" width="680" title="tty open" class="eo n ff dy bg" scrolling="no"></iframe></figure><p>Enabling this will allow us to attach to any running container and interactively engage with its current running process, in this scenario, with a tool like pry.</p><p>Adding a simple <code>binding.pry </code>break-point on our app, we can access it by first, identifying what is the running container ID that we want to attach to:</p><p>$ docker ps<br>CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES<br>(...)<br><strong>7fba610d5c62</strong> fooapp_web &#x201C;bundle exec puma&#x201D; 0.0.0.0:3000-&gt;3000/tcp <br>fooapp_web_1<br>(...)</p><p>And attaching to it:</p><p>$ docker attach <strong>7fba610d5c62</strong>From: /opt/fooapp/app/views/static_pages/root.html.erb @ line 5 ActionView::CompiledTemplates#_app_views_static_pages_root_html_erb___2205918372626965601_36280540:    1: A new website.<br>    2:<br>    3: This is not a new feature :)<br>    4:<br> =&gt; 5: &lt;% binding.pry %&gt;[1] pry(#&lt;#&lt;Class:0x00000004531cf0&gt;&gt;)&gt;</p><p>From here you can engage with the breakpoint as you would do on your common development workflow.</p><h1 id="wrap-up">Wrap up</h1><p>Even though this is a very minimalistic set up, it does comprise a contained, multi services application foundation that serves itself as the basis for many of the projects you may end up working at. This setup also has the bonus of actually mimicking the machine structure of a multi-service application.</p><p>The simplistic approach of using this configuration files will not only keep your development workflow sane, if you are working on multiple projects with a diverse dependancies ecosystem, but will also dramatically ease the entry curve for any newcomers to virtually any project you have set up.</p><p>Please feel free to reference and contribute to the <a href="https://gist.github.com/josetapadas/044fda80719f218c1a40a0b7492bdd3e?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Dockerfile</a> and <a href="https://gist.github.com/josetapadas/a94dc057f26e442c9d17f20721b09dbd?ref=jose.tapadas.dev" rel="noopener ugc nofollow">Docker Compose</a> configuration files used in this example.</p><h2 id="further-reading">Further reading</h2><ul><li><a href="https://docs.docker.com/engine/understanding-docker/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://docs.docker.com/engine/understanding-docker/</a> : <em>Understanding Docker</em>, from the official documentation</li><li><a href="https://docs.docker.com/compose/reference/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">https://docs.docker.com/compose/reference/</a> : Docker Compose command reference</li><li><a href="http://docker-sync.io/?ref=jose.tapadas.dev" rel="noopener ugc nofollow">http://docker-sync.io/</a> : enables a performance boost on OSX users by using either <strong>rsync</strong> or <strong>unison</strong> to sync the volumes between the host machine and the containers</li></ul>]]></content:encoded></item></channel></rss>