Average Value Query doesn't make sense #19952

JustinPrivitera · 2024-10-25T00:29:48Z

VisIt 3.4.1

I am looking at curv2d.silo variables d and u. d is zonal and u is nodal.

I have run Variable Sum queries for both, a NumZones query, a NumNodes query, and Average Value queries for both. Below are the results:

If I was computing the average for d, I would take the total d = 3453.26 and divide it by the number of zones = 988. This yields 3.495, which is off from the Average Value query, which gives 3.861.

If I was computing the average for u, I would take the total u = 67.7007 and divide it by the number of nodes = 1053. This yields 0.0685, which is off from the Average Value query, which gives 0.0222.

This isn't unique to curv2d.silo, which has ghosts. It happens in curv3d.silo as well:

You can verify that the averages are again wrong.

Is the Average Value query doing something with weights? How do I just get the mean value? We should either correct what looks like faulty behavior or make it clear to users what the Average Value query is actually doing.

The text was updated successfully, but these errors were encountered:

biagas · 2024-10-25T00:35:24Z

AverageValue query is a subclass off avtWeightedVariableSummationQuery, you would have to look at the parent source to determine which weighting is being applied, I don't recall offhand.

JustinPrivitera · 2024-10-25T00:38:45Z

I saw that in the source code and I didn't go down the rabbit hole of trying to understand how it was applying weighting.

I think this is opaque to users and the average value query should either work as you'd expect it to or we can provide controls for the weighting.

JustinPrivitera · 2024-10-25T19:15:45Z

The Sample Statistics Query similarly gives strange answers.

The first set of values is for d and the second is for u. The mean for d matches up with what we computed up above for the average for d, but the mean for u is way off. I think both standard deviations are incorrect.

JustinPrivitera · 2024-10-29T00:58:17Z

When we figure this out, we should double check the values with the values computed in the tests for #19955

JustinPrivitera · 2024-10-29T01:10:22Z

If the statistics produced by our query system turn out to be right, then we will need to modify the global mesh expressions #19955 in order to calculate these statistics correctly.

JustinPrivitera · 2024-11-09T01:25:42Z

There is another statistics query that gives you the mean; it is also different.

markcmiller86 · 2024-11-09T01:47:40Z

I've probably mentioned this before in another context but its based on the observation that average is just one of perhaps several statistical measures worth computing in a single pass over the dataset. Here are some of those possible measures...

Mean (Average): The sum of all the numbers divided by the count of the numbers. It provides a measure of central tendency. Also known as the First moment.
Median: The middle value when the numbers are arranged in ascending order. If the count of numbers is even, the median is the average of the two middle numbers. This measure is less affected by outliers and skewed data.
Mode: The number that appears most frequently in the dataset. There can be multiple modes (bimodal, trimodal, etc.) or no mode if all values are unique.
Range: The difference between the maximum and minimum values in the dataset. It gives a basic idea of the spread.
Min/Max: The minimum and maximum values. Also known as Extremums.
Variance: A measure of how spread out the numbers are around the mean. It is the average of the squared differences from the Mean. Also known as the Second moment.
Standard Deviation: The square root of the variance, providing a measure of the spread of the numbers around the mean in the same units as the data.
Interquartile Range (IQR): The range between the 25th percentile (first quartile) and the 75th percentile (third quartile). It is a robust measure of variability that is resistant to outliers.
Skewness: A measure of the asymmetry of the data around the sample mean. If the skewness is close to 0, the data are fairly symmetrical.
Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. It can indicate the presence of outliers.
Percentiles: Values below which a certain percent of the data fall. For example, the 50th percentile is the median.

I think it would be best if we had both a query and an expression called Statistics or Statistical measures which did this all in a single pass. There is not a lot of computing involved in any of these values to not simply do all of them when we do any one.

In the Query GUI for Statistic or Statistical measures, we could have check boxes for which values actually get displayed in response similiar to all the check boxes for a node/zone pick. Or, better yet, we just return them all in the query response and the user can decide which value(s) they want to pay attention to. But, all values would nonetheless be computed.

In expressions (which I think we discussed recently), an expression such as statistics('name1',['name2',...]) would still compute all these metrics but then return as a vector value, only those statistics indicated in the args to the expression so that, for example, statistics('mean','variance','min/max')[1] would be the variance whereas statistics()[2] (no args) would return the default (everything) and in that case, an index of 2, for example, would index into the global list of statistical metrics (above), which is mode.

JustinPrivitera · 2024-11-09T01:51:02Z

I like this idea.

The other query I was thinking of was population statistics.

It should at least be transparent how each query is calculating statistical values - is it a weighted mean or an unweighted mean?

markcmiller86 · 2024-11-09T08:34:24Z

Why not compute in both weighted and unweighted and produce both results?

Regarding the verbiage used in the query results...I see words like total, actual and average being used without complete clarity about what they all mean. What does actual mean? I think it means all non-ghost zones. Does total mean all non-ghost too or everything including ghost? We could firm up terminology a bit here.

markcmiller86 · 2024-11-10T17:24:42Z

Somewhat related, #5135

JustinPrivitera added bug Something isn't working likelihood medium Neither low nor high likelihood impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) labels Oct 25, 2024

JustinPrivitera mentioned this issue Oct 29, 2024

Global mesh expressions #19955

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average Value Query doesn't make sense #19952

Average Value Query doesn't make sense #19952

JustinPrivitera commented Oct 25, 2024

biagas commented Oct 25, 2024

JustinPrivitera commented Oct 25, 2024

JustinPrivitera commented Oct 25, 2024

JustinPrivitera commented Oct 29, 2024

JustinPrivitera commented Oct 29, 2024

JustinPrivitera commented Nov 9, 2024

markcmiller86 commented Nov 9, 2024 •

edited

Loading

JustinPrivitera commented Nov 9, 2024

markcmiller86 commented Nov 9, 2024

markcmiller86 commented Nov 10, 2024

Average Value Query doesn't make sense #19952

Average Value Query doesn't make sense #19952

Comments

JustinPrivitera commented Oct 25, 2024

biagas commented Oct 25, 2024

JustinPrivitera commented Oct 25, 2024

JustinPrivitera commented Oct 25, 2024

JustinPrivitera commented Oct 29, 2024

JustinPrivitera commented Oct 29, 2024

JustinPrivitera commented Nov 9, 2024

markcmiller86 commented Nov 9, 2024 • edited Loading

JustinPrivitera commented Nov 9, 2024

markcmiller86 commented Nov 9, 2024

markcmiller86 commented Nov 10, 2024

markcmiller86 commented Nov 9, 2024 •

edited

Loading