Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec fill/Interpolate function #436

Open
adamperlin opened this issue Jul 3, 2018 · 6 comments
Open

Spec fill/Interpolate function #436

adamperlin opened this issue Jul 3, 2018 · 6 comments

Comments

@adamperlin
Copy link
Contributor

adamperlin commented Jul 3, 2018

From ifql created by aanthony1243 : influxdata/ifql#255

Interpolate Proposed as a replacement for Fill()

The basic notion of fill() in InfluxQL is to fill in missing values under certain conditions, most often in the case of an empty time window or group. A more general notion is that of interpolate() which should solve the existing use cases, and also provide more advanced features. The end of this text includes some concrete examples of InfluxQL usage for the fill() function.

initial writeup of interpolate design, including a discussion of the Window() operation can be found here: https://docs.google.com/document/d/1C4ELMRJblvy7UsyBF_o3TGVfGrydChKET2LdGOYu53k/edit?usp=sharing

from the above doc:

Types of interpolation:

Interpolation may be defined on a NxM matrix as filling in the null/missing values on that matrix. In influxdb, a matrix can be materialized as a collection of related series where the row-indices are the series timestamp, and the column indices are series ID:

time/seriesID m,f1=foo m,f1=bar m,f1=asd
T1 (t1, 23) null (t3,19)
T2 (t1,1) (t2,5) (t3,2)
T3 (t1,1) null null

We’ll initially define two types of interpolation, one that operates only on rows of the matrix, and one that operates only on the columns of the matrix.

Column-wise interpolation

A column-wise interpolation will focus on a single series, so that other series in the data matrix are not considered. A few types of interpolation on a column to consider are:

  1. Fill with null
  2. Fill with constant
  3. Fill with previous column value from the series
  4. Fill with an aggregate or selector from the entire series (min, max, mean, count, etc.)
  5. Fill with a moving average from the series
  6. Linear interpolation between two known values on the series

Row-wise interpolation

Row-wise interpolation isn’t necessarily different from column-wise interpolation. In principle, a row-wise interpolation can be achieved by applying a column-wise interpolation to the transposed data matrix. However, in InfluxDB, there are some limitations because while the rows of the data matrix in this context are numeric and sorted by time, the columns of the matrix are categorical and may appear in any arbitrary order. With this in mind, we consider the following row-wise interpolations valid:

  1. Fill with null
  2. Fill with constant value
  3. Fill with aggregate or selector from other series, but in the same time row. (e.g. from(db) |> selectMeasurement(m) |> range(start:t2, end:t3) |> mean() )

If it were possible to assign a total ordering on the rows of the matrix, then the remaining interpolations may be well-defined:

  • Fill with ‘previous’ series value in the same time range row
  • Fill with a moving average of series values found in the same on the time range row.

Finally, linear interpolation would be difficult to define generally on a row. If some notion of numeric distance between row values may be defined on the tags, then some form of linear interpolation may be applied. Currently, the language has little support for this other than presuming a fixed unit of distance between two adjacent columns in the matrix.

Column-Wise Interpolation Operator

The interpolation function on a column can be simplified into two parts:
Identify each missing value in the series.
Determine a replacement for that missing value.

The most convenient scenario for interpolation will be where a timestamped NULL value exists on the series. Later, we will lift this assumption, but given that a series contains timestamped nulls, a possible interpolate function would be:

interpolate(table=<-, nullFn, fillFn)

Where nullFn is a function defined on a series value that returns true/false if a value should be considered a candidate for replacement, and fillFn is a function that generates a value to insert in place of the NULL value.

This approach would be the most general, but possibly inefficient unless the fillFn can be optimized such that it can be computed quickly. In general, interpolation is computed using known values that are near the NULL value. We can cache these values into a moving window as we scan the series, and build in the various functions that may be desirable for interpolation:
interpolate(table=<-, nullFn, stepsPrev, stepsNext, fillType=”prev”)

Where stepsPrev and stepsNext indicate the boundary of the window to each side of the NULL value, and the fill Type may be one of:

  • Constant value: set the value to some specified constant
  • Prev: set the value to the furthest previous value
  • Next: set the value to the farthest next value
  • Min/Max/Mean/Count/AggFn: aggregate all values found within the window.
  • Linear Interpolation: use point-slope computation to determine a value on the line between the furthest previous and farthest next value.
  • Linear regression: fit a line to all points in the window, and estimate a value for the NULL based on this line.
  • ?? may be tricky to allow this to be customizable with a function, but worth discussing.

Appendix: InfluxQL examples, with Flux adaptations:

External requirements

  1. Null values: we need to have some representation of null/missing values for a row so that we can identify where to apply interpolation.

  2. Many existing InfluxQL queries use grouping/windowing to segment the data, and then a Fill() operation to insert a default value for any empty segments. To get the most out of this feature, we need to make sure that our Window() function outputs empty groups either by default or by argument:
    |> window(every: 10m, keepEmpty: true)

  3. NOTE: depending on how we implement Window() example 1 may not require interpolate() at all, if we did something like: |> window(every:10m, emptyValue: 0

Example 1:
Many chronograf queries require data for each point on an axis. If the axis is populated with the results of a GROUP BY, empty groups must get a default value. Exmaple:

// SELECT mean("queueBytes")
// FROM "telegraf"."default"."influxdb_hh_processor"
// WHERE "cluster_id" = :Cluster_Id:
//       AND time > :dashboardTime:
// GROUP BY :interval:, "host" fill(0)
HintedHandoffQueueSize = (INTERVAL, CID, DASHTIME) =>
  telegrafDashtime(DASHTIME: DASHTIME)
    |> withCID(value: CID)
    |> select(measurement: "influxdb_hh_processor", field: "queueBytes")
    |> window(every: INTERVAL, keepEmpty: true)
    |> interpolate(nullFn: (r) => r._field == NULL,  0, 0, fillType="constant", fillValue=0)
    |> group(by: ["host"])
    |> mean()

Example 2:
it's not well-defined what happens here, if anything. But there are some queries on the cloud monitor that use Fill(0) independently of a GROUP BY:

// InfluxQL Disk Usage Query
// SELECT last("used")/1073741824 AS "used" FROM
//   "telegraf"."default"."disk"
//   WHERE time > :dashboardTime:
//     AND cluster_id = :Cluster_Id:
//     AND (host =~ /.data./ OR host =~ /tot-.*-(3|4)/)
//   FILL(0)
DiskUsage = (DASHTIME) =>
    fromRange(forDB:"telegraf", forRange:DASHTIME)
      |> select(measurement: "disk", field: "used")
      //|> withCID(value: CID)
      //|> hostfilter()
      |> last()
      |> map(fn:(r) => r._value / 1073741824)

It's most possible that this query started out as a continuous query, where fill() is defined in some contexts in this manner. In the query above, it's not 100% clear what's meant to be filled. This use-case may not require any changes unless we support continuous-Flux queries.

@nathanielc nathanielc transferred this issue from another repository Dec 10, 2018
@nathanielc nathanielc changed the title need fill/Interpolate function Spec fill/Interpolate function Jan 7, 2019
@nikoladsp
Copy link

nikoladsp commented Jan 28, 2021

Hi,
just a question - what is the status of this functionality? We switched to v2 recently and need some missing data interpolation. If this is not in your plans, what are our options?

Thank you in advance

Kindest regards

@cifwig
Copy link

cifwig commented Apr 29, 2021

I'm struggling with the measurement-series gaps and was looking for a solution.
After some research I have found this:
https://blog.timescale.com/blog/sql-functions-for-time-series-analysis/
May be, their solution gives you an idea for an approach.
See also: #6967

@nathanielc
Copy link
Contributor

We have started with a simple function for linear interpolation see https://github.com/influxdata/flux/blob/master/stdlib/interpolate/interpolate.flux Public docs are incoming

@empz
Copy link

empz commented May 10, 2021

We're in the need of having backward fill interpolation. Is this planned?
We'd like to have it available on InfluxDB Cloud as soon as possible.

@mdraevich
Copy link

linear interpolation is not enough. The status of the feature should be updated

Copy link

This issue has had no recent activity and will be closed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants