Downsampling with existing data with InfluxDB

2023-05-03
2023-05-03
Header: influxdb-downsample

Setting up continuous query and retention policy for downsampling with existing data in InfluxDB 1.x.

This series contains the following posts:

I'm using InfluxDB 1.8 because starting from version 2.0 there is no ARM 32bit builds. Also, version 1.x is much simpler than 2.0.

However, things should be similar with InfluxDB 2.0 tasks.

The InfluxDB documentation includes a great guide for downsampling and retaining data. However, if you are not starting from an empty database, the guide is not enough.

I have all the data from the sensors in the database sensors. The data looks like this:

SELECT * FROM eco2 WHERE time > now() - 10s;
sql
name: eco2
time                sensor value
----                ------ -----
1683080134061871104 SGP30  836
1683080135183121152 SGP30  833
1683080136307070720 SGP30  830
1683080137418324480 SGP30  820
1683080138542053888 SGP30  826
1683080139652603392 SGP30  833
1683080140753946368 SGP30  831
1683080141879817472 SGP30  828
1683080143005833728 SGP30  835

And I want to downsample the data into another database called sensors-history:

CREATE DATABASE sensors_history;
sql

You can just create another retention policy instead of creating a new database, and you just need to change the queries below from "sensors-history"."autogen" to "sensors"."oneyear" where oneyear is the name of the retention policy you just created.

Downsample all existing data#

Before creating a retention policy on sensors, we first need to downsample all existing data into sensors-history:

USE sensors;

SELECT MEAN(*) INTO "sensors_history"."autogen".:MEASUREMENT
FROM /.*/ GROUP BY time(10m),*;
sql

I added newlines in the query for readability. If you are executing the queries from the influx REPL, you need to remove the newlines for each query.

The wildcard and the backreferencing syntax :MEASUREMENT can be found in the InfluxQL documentation. All it does is to select all data from all measurements (the regex /.*/ part), downsample them with the interval of 10m, and write to the corresponding :MEASUREMENTs in sensors-history database with the autogen retention policy, preserving all the tags (GROUP BY * part).

If you want more details, you can do this:

SELECT
  MEAN("value") AS mean_value,
  PERCENTILE("value",95) AS high_value,
  PERCENTILE("value",5) AS low_value
INTO "sensors_history"."autogen".:MEASUREMENT
FROM /.*/
GROUP BY time(10m),*;
sql

where I used the PERCENTILE function instead of MAX and MIN for a more robust aggregation. As a consequence, I have to use AS for aliasing, as the default query naming fails to distinguish the two percentile functions.

Setting up a retention policy#

After checking that all data is downsampled into sensors-history, it's time to set up the retention policy and delete old data. Here I will just keep the data for one day:

CREATE RETENTION POLICY "oneday" ON "sensors" DURATION 1d REPLICATION 1 DEFAULT;
sql

where REPLICATION 1 is the default for single-node InfluxDB deployment.

Since I've set the new policy oneday as the default, the query will return an empty result if the retention policy is not specified in the query. In other words, it may look like you've lost all your data after setting a retention policy.

To make the query return recent 24h data, I need to move the recent data from old autogen to oneday:

SELECT * INTO "oneday".:MEASUREMENT
FROM "autogen"./.*/
WHERE time > now() - 1d GROUP BY *;
sql

After checking everything is working fine, it's time to remove old data.

DELETE FROM /.*/ WHERE time < now() - 1d;
sql

Creating the continuous query#

The last step is to create the continuous query:

CREATE CONTINUOUS QUERY "cq_tidy_sensors" ON "sensors"
BEGIN
  SELECT
    MEAN("value") AS mean_value,
    PERCENTILE("value",95) AS high_value,
    PERCENTILE("value",5) AS low_value
  INTO "sensors_history"."autogen".:MEASUREMENT
  FROM "sensors"."oneday"./.*/
  GROUP BY time(10m),*;
END
sql

The SELECT query is similar to the query above, but it's better to specify the retention policy oneday, especially if you didn't make it the default.

The continuous query will automatically execute every 10m, selecting recent 10m of data from the sensors database to sensors-history. You can check your continuous query by running

SHOW CONTINUOUS QUERIES;
sql

If everything goes fine, you will see something like this every 10m in your InfluxDB log (docker compose logs if using Docker Compose):

msg="Continuous query execution (start)" service=continuous_querier op_name=continuous_querier_execute op_event=start
msg="Executing continuous query" service=continuous_querier op_name=continuous_querier_execute name=cq_tidy_sensors db_instance=sensors start=2023-05-03T03:00:00.000000Z end=2023-05-03T03:10:00.000000Z
msg="Executing query" service=query query="SELECT mean(value) AS mean_value, percentile(value, 95) AS high_value, percentile(value, 5) AS low_value INTO sensors_history.autogen.:MEASUREMENT FROM sensors.oneday./.*/ WHERE time >= '2023-05-03T03:00:00Z' AND time < '2023-05-03T03:10:00Z' GROUP BY time(10m), *"
msg="Finished continuous query" service=continuous_querier op_name=continuous_querier_execute name=cq_tidy_sensors db_instance=sensors written=7 start=2023-05-03T03:00:00.000000Z end=2023-05-03T03:10:00.000000Z duration=1069ms
msg="Continuous query execution (end)"

🥳 We have set up downsampling with existing data!

Leave your comments and reactions on GitHub