Spec fill/Interpolate function #436

adamperlin · 2018-07-03T21:52:48Z

From ifql created by aanthony1243 : influxdata/ifql#255

Interpolate Proposed as a replacement for Fill()

The basic notion of fill() in InfluxQL is to fill in missing values under certain conditions, most often in the case of an empty time window or group. A more general notion is that of interpolate() which should solve the existing use cases, and also provide more advanced features. The end of this text includes some concrete examples of InfluxQL usage for the fill() function.

initial writeup of interpolate design, including a discussion of the Window() operation can be found here: https://docs.google.com/document/d/1C4ELMRJblvy7UsyBF_o3TGVfGrydChKET2LdGOYu53k/edit?usp=sharing

from the above doc:

Types of interpolation:

Interpolation may be defined on a NxM matrix as filling in the null/missing values on that matrix. In influxdb, a matrix can be materialized as a collection of related series where the row-indices are the series timestamp, and the column indices are series ID:

time/seriesID	m,f1=foo	m,f1=bar	m,f1=asd
T1	(t1, 23)	null	(t3,19)
T2	(t1,1)	(t2,5)	(t3,2)
T3	(t1,1)	null	null

We’ll initially define two types of interpolation, one that operates only on rows of the matrix, and one that operates only on the columns of the matrix.

Column-wise interpolation

A column-wise interpolation will focus on a single series, so that other series in the data matrix are not considered. A few types of interpolation on a column to consider are:

Fill with null
Fill with constant
Fill with previous column value from the series
Fill with an aggregate or selector from the entire series (min, max, mean, count, etc.)
Fill with a moving average from the series
Linear interpolation between two known values on the series

Row-wise interpolation

Row-wise interpolation isn’t necessarily different from column-wise interpolation. In principle, a row-wise interpolation can be achieved by applying a column-wise interpolation to the transposed data matrix. However, in InfluxDB, there are some limitations because while the rows of the data matrix in this context are numeric and sorted by time, the columns of the matrix are categorical and may appear in any arbitrary order. With this in mind, we consider the following row-wise interpolations valid:

Fill with null
Fill with constant value
Fill with aggregate or selector from other series, but in the same time row. (e.g. from(db) |> selectMeasurement(m) |> range(start:t2, end:t3) |> mean() )

If it were possible to assign a total ordering on the rows of the matrix, then the remaining interpolations may be well-defined:

Fill with ‘previous’ series value in the same time range row
Fill with a moving average of series values found in the same on the time range row.

Finally, linear interpolation would be difficult to define generally on a row. If some notion of numeric distance between row values may be defined on the tags, then some form of linear interpolation may be applied. Currently, the language has little support for this other than presuming a fixed unit of distance between two adjacent columns in the matrix.

Column-Wise Interpolation Operator

The interpolation function on a column can be simplified into two parts:
Identify each missing value in the series.
Determine a replacement for that missing value.

The most convenient scenario for interpolation will be where a timestamped NULL value exists on the series. Later, we will lift this assumption, but given that a series contains timestamped nulls, a possible interpolate function would be:

interpolate(table=<-, nullFn, fillFn)

Where nullFn is a function defined on a series value that returns true/false if a value should be considered a candidate for replacement, and fillFn is a function that generates a value to insert in place of the NULL value.

This approach would be the most general, but possibly inefficient unless the fillFn can be optimized such that it can be computed quickly. In general, interpolation is computed using known values that are near the NULL value. We can cache these values into a moving window as we scan the series, and build in the various functions that may be desirable for interpolation:
interpolate(table=<-, nullFn, stepsPrev, stepsNext, fillType=”prev”)

Where stepsPrev and stepsNext indicate the boundary of the window to each side of the NULL value, and the fill Type may be one of:

Constant value: set the value to some specified constant
Prev: set the value to the furthest previous value
Next: set the value to the farthest next value
Min/Max/Mean/Count/AggFn: aggregate all values found within the window.
Linear Interpolation: use point-slope computation to determine a value on the line between the furthest previous and farthest next value.
Linear regression: fit a line to all points in the window, and estimate a value for the NULL based on this line.
?? may be tricky to allow this to be customizable with a function, but worth discussing.

Appendix: InfluxQL examples, with Flux adaptations:

External requirements

Null values: we need to have some representation of null/missing values for a row so that we can identify where to apply interpolation.
Many existing InfluxQL queries use grouping/windowing to segment the data, and then a Fill() operation to insert a default value for any empty segments. To get the most out of this feature, we need to make sure that our Window() function outputs empty groups either by default or by argument:
|> window(every: 10m, keepEmpty: true)
NOTE: depending on how we implement Window() example 1 may not require interpolate() at all, if we did something like: |> window(every:10m, emptyValue: 0

Example 1:
Many chronograf queries require data for each point on an axis. If the axis is populated with the results of a GROUP BY, empty groups must get a default value. Exmaple:

// SELECT mean("queueBytes")
// FROM "telegraf"."default"."influxdb_hh_processor"
// WHERE "cluster_id" = :Cluster_Id:
//       AND time > :dashboardTime:
// GROUP BY :interval:, "host" fill(0)
HintedHandoffQueueSize = (INTERVAL, CID, DASHTIME) =>
  telegrafDashtime(DASHTIME: DASHTIME)
    |> withCID(value: CID)
    |> select(measurement: "influxdb_hh_processor", field: "queueBytes")
    |> window(every: INTERVAL, keepEmpty: true)
    |> interpolate(nullFn: (r) => r._field == NULL,  0, 0, fillType="constant", fillValue=0)
    |> group(by: ["host"])
    |> mean()

Example 2:
it's not well-defined what happens here, if anything. But there are some queries on the cloud monitor that use Fill(0) independently of a GROUP BY:

// InfluxQL Disk Usage Query
// SELECT last("used")/1073741824 AS "used" FROM
//   "telegraf"."default"."disk"
//   WHERE time > :dashboardTime:
//     AND cluster_id = :Cluster_Id:
//     AND (host =~ /.data./ OR host =~ /tot-.*-(3|4)/)
//   FILL(0)
DiskUsage = (DASHTIME) =>
    fromRange(forDB:"telegraf", forRange:DASHTIME)
      |> select(measurement: "disk", field: "used")
      //|> withCID(value: CID)
      //|> hostfilter()
      |> last()
      |> map(fn:(r) => r._value / 1073741824)

It's most possible that this query started out as a continuous query, where fill() is defined in some contexts in this manner. In the query above, it's not 100% clear what's meant to be filled. This use-case may not require any changes unless we support continuous-Flux queries.

The text was updated successfully, but these errors were encountered:

nikoladsp · 2021-01-28T09:01:40Z

Hi,
just a question - what is the status of this functionality? We switched to v2 recently and need some missing data interpolation. If this is not in your plans, what are our options?

Thank you in advance

Kindest regards

cifwig · 2021-04-29T15:57:27Z

I'm struggling with the measurement-series gaps and was looking for a solution.
After some research I have found this:
https://blog.timescale.com/blog/sql-functions-for-time-series-analysis/
May be, their solution gives you an idea for an approach.
See also: #6967

nathanielc · 2021-04-29T20:51:05Z

We have started with a simple function for linear interpolation see https://github.com/influxdata/flux/blob/master/stdlib/interpolate/interpolate.flux Public docs are incoming

empz · 2021-05-10T12:00:28Z

We're in the need of having backward fill interpolation. Is this planned?
We'd like to have it available on InfluxDB Cloud as soon as possible.

mdraevich · 2021-09-16T22:06:09Z

linear interpolation is not enough. The status of the feature should be updated

github-actions · 2024-11-25T01:58:15Z

This issue has had no recent activity and will be closed soon.

nathanielc transferred this issue from another repository Dec 10, 2018

nathanielc mentioned this issue Dec 20, 2018

Alpha - Flux influxdata/docs-v2#2

Closed

nathanielc changed the title ~~need fill/Interpolate function~~ Spec fill/Interpolate function Jan 7, 2019

nathanielc added the difficulty/low label Jan 7, 2019

nathanielc added the team/query label Jul 1, 2019

russorat added enhancement New feature or request new function func/fill labels Mar 4, 2020

github-actions bot added the no-issue-activity label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec fill/Interpolate function #436

Spec fill/Interpolate function #436

adamperlin commented Jul 3, 2018 •

edited by timhallinflux

Loading

nikoladsp commented Jan 28, 2021 •

edited

Loading

cifwig commented Apr 29, 2021

nathanielc commented Apr 29, 2021

empz commented May 10, 2021

mdraevich commented Sep 16, 2021

github-actions bot commented Nov 25, 2024

Spec fill/Interpolate function #436

Spec fill/Interpolate function #436

Comments

adamperlin commented Jul 3, 2018 • edited by timhallinflux Loading

Interpolate Proposed as a replacement for Fill()

Types of interpolation:

Column-wise interpolation

Row-wise interpolation

Column-Wise Interpolation Operator

Appendix: InfluxQL examples, with Flux adaptations:

External requirements

nikoladsp commented Jan 28, 2021 • edited Loading

cifwig commented Apr 29, 2021

nathanielc commented Apr 29, 2021

empz commented May 10, 2021

mdraevich commented Sep 16, 2021

github-actions bot commented Nov 25, 2024

adamperlin commented Jul 3, 2018 •

edited by timhallinflux

Loading

nikoladsp commented Jan 28, 2021 •

edited

Loading