Analytics Data on SQL Database – Best database and table design for billions of rows of data

This is not an article that I am writing but I’m mostly quoting a great gem on a stack overflow answer I came across when I was researching a DIY way to store and create analytics reports for a small to medium size project. The project’s type doesn’t matter because this is a generic problem and great solution.

Why not use analytics tools/services?

I am in constant search of the better alternatives or simpler versions of the solutions we use at my team. We certainly use many services and tools from open source to licensed software. But I still choose to understand, know and be able to apply these solutions by myself on a custom solution where I have full control over the data, output and user experience.

So I casually read and research how others approach the issues or queries wander in my mind.

Then I stumbled upon this stack overflow thread with a brilliant answer that contains steps to try out from scratch that I suggest any engineer to just try and play on their own time.

PostgreSQL and BRIN indexes

To create a sample table with 1.7 billion rows of a sample sensor data (temperature read from the sensor with timestamps in the logs):

So it took 22min to create the table. Largely, because the table is a modest 97GB. Next, we create the indexes,

It took a good long while to create the indexes too. Though because they’re BRIN they’re only 2-3 MB and they store easily in ram. Reading 96 GB isn’t instantaneous, but it’s not a real problem for my laptop at your workload.

Now we query it.

Update with timestamps

Here we generate a table with different timestamps in order to satisfy the request to index and search on a timestamp column, creation takes a bit longer because  to_timestamp(int) is substantially more slow than  now() (which is cached for the transaction)

Now we can run a query on a timestamp value instead,,

Result:

So in 83.321 ms we can aggregate 86,401 records in a table with 1.7 Billion rows. That should be reasonable.

Hour ending

Calculating the hour ending is pretty easy too, truncate the timestamps down and then simply add an hour.

It’s important to note, that it’s not using an index on the aggregation, though it could. If that’s your typical query you probably want a BRIN on  date_trunc('hour', tsin) therein lies a small problem in that  date_trunc is not immutable so you’d have to first wrap it to make it so.

Partitioning

Another important point of information on PostgreSQL is that PG 10 bring partitioning DDL. So you can, for instance, easily create partitions for every year. Breaking down your modest database into minor ones that are tiny. In doing so, you should be able to use and maintain btree indexes rather than BRIN which would be even faster.

This is a great answer to the topic around working with analytics data on SQL databases. Finally thinking about table partitioning is always a good plan-ahead strategy for any data gets over millions and have distributed data around timestamps.

Reference: Best database and table design for billions of rows of data