<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://watacoso.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://watacoso.github.io/" rel="alternate" type="text/html" /><updated>2025-07-04T17:11:34+00:00</updated><id>https://watacoso.github.io/feed.xml</id><title type="html">Watacoso’s Blog</title><subtitle>I work on data. I experiment a bit. Games are nice too.</subtitle><author><name>Guido Pintus</name></author><entry><title type="html">Data from scratch(1): Building a Barebones Data Ingestion Package</title><link href="https://watacoso.github.io/2025/07/04/Data-from-scratch(1)-building-a-barebones-data-ingestion-package.html" rel="alternate" type="text/html" title="Data from scratch(1): Building a Barebones Data Ingestion Package" /><published>2025-07-04T00:00:00+00:00</published><updated>2025-07-04T00:00:00+00:00</updated><id>https://watacoso.github.io/2025/07/04/Data-from-scratch(1):-building-a-barebones-data-ingestion-package</id><content type="html" xml:base="https://watacoso.github.io/2025/07/04/Data-from-scratch(1)-building-a-barebones-data-ingestion-package.html"><![CDATA[<p>I am still setting up this blog, so I will be writing about some of the things I am learning as I go along.</p>

<hr />

<h3 id="do-we-need-another-data-ingestion-package">Do we need another data ingestion package?</h3>

<p>Apart from filling up yet another name in the Pypi repository, I don’t think <em>miniduct</em> is going to accomplish much. In fact, i am sure that there are countless packages that do exactly what miniduct does. And surely better. and in a very elegant way. In fact, i could just run an instance of Airflow or Dagster, and i’d be done with it, having at my disposal a plethora of features that any data enthusiast in a corporate environment loves to have, like:</p>

<ul>
  <li>Scheduling</li>
  <li>Monitoring</li>
  <li>Logging</li>
  <li>Error handling</li>
  <li>Backfilling</li>
  <li>parallel execution</li>
</ul>

<p>the list is huge. Quite huge in fact, that for a new data engineer it can be quite overwhelming. Why do we need all these features anyway? what pushed engineers to build such complex systems?</p>

<h4 id="starting-simple">Starting simple</h4>

<p>Engineers build on top of the work of others, and we compose systems that solve very tangible problems that a lot of people had in their careers.</p>

<p>I have ‘designing data intensive applications’ by Martin Kleppmann in my libraty, and i ofter get back to reading it. whenever i learn about some old or new technology, i can find the roots of their implementation in the concepts described in the book. I love the approach that the book has in starting from simple systems, like a data storage build on a few bash commands, and then explains what additional requirements shape the miriad of implementations of data systems that we have today.</p>

<p>So miniduct, and other packages that i will be building, will start from humble origins, and will expand o</p>

<h3 id="what-is-miniduct">What is miniduct?</h3>

<p>Let’s start from this problem:</p>

<pre><code>We need a way to ingest data from an API, and save it to a folder in the host filesystem.
</code></pre>

<p>in python, we can to this with a few lines of code:</p>

<pre><code class="language-python">import requests
import os

def fetch_data(api_url, save_path):
    response = requests.get(api_url)
    response.raise_for_status()  # Ensure we raise an error for bad responses
    with open(save_path, 'w') as file:
        file.write(response.text)

</code></pre>

<p>now every time another developer needs data from the API, they can just call this function, and the data will be saved to the specified path.</p>

<pre><code>fetch_data('https://api.cats.com/names', '/workdir/cats/names.txt')
</code></pre>]]></content><author><name>Guido Pintus</name></author><category term="Other" /><summary type="html"><![CDATA[I am still setting up this blog, so I will be writing about some of the things I am learning as I go along.]]></summary></entry></feed>