The problem with most small businesses is that you can’t afford to hire a data team, nor set up a proper data warehouse. And yet even the smallest of Internet businesses can have fairly complex data needs.

Let’s say you’re running a simple newsletter business. You use Google analytics to track anonymized user clicks (and you can’t tie those clicks to your members.) Your website stores member information and knows when they last logged in, but not what they read. You have no idea which articles are popular amongst members. Your email sending software tracks your (free) newsletter subscribers, emails sent, and open rate, but not your subscriber to member conversion rate.

The clear answer here is to put all of these data sources into one place. Otherwise you won’t know what options you have to grow the business; you have all these data points but you can’t come up with an end-to-end explanation of how your business works.

The common solution is to put all the data into a data warehouse. This is great, except … a full blown data warehouse is a huge time and cost commitment. What would be ideal is if you have a minimum viable data warehouse — something that simply combines data from multiple vendors in a single database.

This blog post will talk about some of the learnings I earned from building and operating such a data warehouse.

The Stack

Other than the core storage layer, our data warehouse will need custom scripts to get data from multiple vendors, a scheduler to regularly run these scripts, and a visualization tool to make the data easily accessible for business stakeholders.

Since my needs are simple - collect all the data in one place - I started with Postgresql as a database, a custom Golang script, a cron job to run the Go scripts periodically and Metabase Open Source as the visualization tool of choice. This stack fits in a small ($6/month) Virtual Private Server (a rented server) and can reliably handle our small data size.

The components can be easily changed. You can swap the visualization tool from Metabase to Rill or use another database like MySQL or DuckDB for core storage layer. Since I am familiar with Golang, I use it to write my scripts, but you can use Python too.

With Postgres and Metabase set up, let’s now turn our attention to the scripts. These scripts collect data from multiple vendors, clean it, and store it in the database.

Making cron jobs production-ready

To run the scripts regularly, I use cron. Cron comes preinstalled on your server and is a simple tool to run scripts. You can also tell cron when to run your scripts at different frequencies: every 5 minutes or every day at exactly 8:13 in the morning. But cron jobs can fail for multiple reasons: your script can’t reach the vendor, the vendor is down, your machine runs out of memory etc. As I’ve worked with my Go script, I’ve learned a few things to make it run more reliably.

Idempotency

Idempotency is a property of a script to have the same final result regardless of how many times you run it. A script that reads a number from a file, divide it by 1, and writes the result back to the file is idempotent. A similar script that adds the number read by 2 is not.

Designing your script with idempotency gives you peace of mind. You can run it multiple times without fearing that the next run will fail because you already have the data. If your script encounters some transient network issue, you can simply retry and it will backfill the data.

I use the INSERT ... ON CONFLICT DO NOTHING SQL statement to achieve idempotency. This SQL statement inserts a row into a table if it does not exist. If it does exist, it will do nothing. This is very useful because now you can run the script as many times as you like and it will only insert new data.

Monitoring

As cron jobs can fail, I need a way to be notified when a script run successfully - or when it failed. I send a status message to myself (via telegram) every time the script is run. I also report the error encountered by the failing script so I can quickly determine if I need to fix the script or simply wait for the next time it run (say if it fails due to transient network issue).

I host my telegram bot on val.town. After every script runs, it will report the status by making a POST request to this bot, which will send me a message.

2 Modes of Filling Data

When designing my script, I use 2 modes of filling data: batch and incremental. Running in batch mode backfills the data from the beginning. I run it once when I first bootstrap my data warehouse. For the daily cron job, I run the script in incremental mode.

My script uses a --since flag to switch between these 2 modes. If the flag’s value is not specified, the script will grab all the data for the past week and attempt to fill the data into the database. Since the script is idempotent, the risk of permanent data loss only happens if my script fails for 7 consecutive days. This has a very low probability, but even then I can always run the script once in batch mode to refill the missing data.

Timezones

A variant of idempotency is making sure that the results are consistent regardless of the time it is run. Some vendors expose metrics that use the local timezone. Others use UTC time. So, you might get different results depending on when the script is run. This can lead to inconsistent and confusing data.

When developing the scripts, I need to make sure the script produces consistent results regardless of the time it is run. This often means only populating data up to yesterday, which is a tradeoff of data freshness for idempotency.

Conclusion

So far I have focused mostly on the scripts. That’s because I didn’t do much outside of standard installation for other parts of the stack, i.e. Postgres, Metabase and Cron. This setup might feel like a scrappy setup, but starting with a simpler setup means lower initial cost, lower maintenance cost, and shorter time-to-value. All of this are crucial when you’re starting out.

That’s all you need to build a minimum viable data warehouse. Now that it’s up, it’s time to analyze your data and find ways to grow your business.