Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average Value Query doesn't make sense #19952

Open
JustinPrivitera opened this issue Oct 25, 2024 · 10 comments
Open

Average Value Query doesn't make sense #19952

JustinPrivitera opened this issue Oct 25, 2024 · 10 comments
Labels
bug Something isn't working impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) likelihood medium Neither low nor high likelihood

Comments

@JustinPrivitera
Copy link
Member

VisIt 3.4.1

I am looking at curv2d.silo variables d and u. d is zonal and u is nodal.

I have run Variable Sum queries for both, a NumZones query, a NumNodes query, and Average Value queries for both. Below are the results:

image

If I was computing the average for d, I would take the total d = 3453.26 and divide it by the number of zones = 988. This yields 3.495, which is off from the Average Value query, which gives 3.861.

If I was computing the average for u, I would take the total u = 67.7007 and divide it by the number of nodes = 1053. This yields 0.0685, which is off from the Average Value query, which gives 0.0222.

This isn't unique to curv2d.silo, which has ghosts. It happens in curv3d.silo as well:
image

You can verify that the averages are again wrong.

Is the Average Value query doing something with weights? How do I just get the mean value? We should either correct what looks like faulty behavior or make it clear to users what the Average Value query is actually doing.

@JustinPrivitera JustinPrivitera added bug Something isn't working likelihood medium Neither low nor high likelihood impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) labels Oct 25, 2024
@biagas
Copy link
Contributor

biagas commented Oct 25, 2024

AverageValue query is a subclass off avtWeightedVariableSummationQuery, you would have to look at the parent source to determine which weighting is being applied, I don't recall offhand.

@JustinPrivitera
Copy link
Member Author

I saw that in the source code and I didn't go down the rabbit hole of trying to understand how it was applying weighting.

I think this is opaque to users and the average value query should either work as you'd expect it to or we can provide controls for the weighting.

@JustinPrivitera
Copy link
Member Author

The Sample Statistics Query similarly gives strange answers.

image

The first set of values is for d and the second is for u. The mean for d matches up with what we computed up above for the average for d, but the mean for u is way off. I think both standard deviations are incorrect.

@JustinPrivitera
Copy link
Member Author

When we figure this out, we should double check the values with the values computed in the tests for #19955

@JustinPrivitera
Copy link
Member Author

If the statistics produced by our query system turn out to be right, then we will need to modify the global mesh expressions #19955 in order to calculate these statistics correctly.

@JustinPrivitera
Copy link
Member Author

There is another statistics query that gives you the mean; it is also different.

@markcmiller86
Copy link
Member

markcmiller86 commented Nov 9, 2024

I've probably mentioned this before in another context but its based on the observation that average is just one of perhaps several statistical measures worth computing in a single pass over the dataset. Here are some of those possible measures...

  • Mean (Average): The sum of all the numbers divided by the count of the numbers. It provides a measure of central tendency. Also known as the First moment.
  • Median: The middle value when the numbers are arranged in ascending order. If the count of numbers is even, the median is the average of the two middle numbers. This measure is less affected by outliers and skewed data.
  • Mode: The number that appears most frequently in the dataset. There can be multiple modes (bimodal, trimodal, etc.) or no mode if all values are unique.
  • Range: The difference between the maximum and minimum values in the dataset. It gives a basic idea of the spread.
  • Min/Max: The minimum and maximum values. Also known as Extremums.
  • Variance: A measure of how spread out the numbers are around the mean. It is the average of the squared differences from the Mean. Also known as the Second moment.
  • Standard Deviation: The square root of the variance, providing a measure of the spread of the numbers around the mean in the same units as the data.
  • Interquartile Range (IQR): The range between the 25th percentile (first quartile) and the 75th percentile (third quartile). It is a robust measure of variability that is resistant to outliers.
  • Skewness: A measure of the asymmetry of the data around the sample mean. If the skewness is close to 0, the data are fairly symmetrical.
  • Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. It can indicate the presence of outliers.
  • Percentiles: Values below which a certain percent of the data fall. For example, the 50th percentile is the median.

I think it would be best if we had both a query and an expression called Statistics or Statistical measures which did this all in a single pass. There is not a lot of computing involved in any of these values to not simply do all of them when we do any one.

In the Query GUI for Statistic or Statistical measures, we could have check boxes for which values actually get displayed in response similiar to all the check boxes for a node/zone pick. Or, better yet, we just return them all in the query response and the user can decide which value(s) they want to pay attention to. But, all values would nonetheless be computed.

In expressions (which I think we discussed recently), an expression such as statistics('name1',['name2',...]) would still compute all these metrics but then return as a vector value, only those statistics indicated in the args to the expression so that, for example, statistics('mean','variance','min/max')[1] would be the variance whereas statistics()[2] (no args) would return the default (everything) and in that case, an index of 2, for example, would index into the global list of statistical metrics (above), which is mode.

@JustinPrivitera
Copy link
Member Author

I like this idea.

The other query I was thinking of was population statistics.

It should at least be transparent how each query is calculating statistical values - is it a weighted mean or an unweighted mean?

@markcmiller86
Copy link
Member

Why not compute in both weighted and unweighted and produce both results?

Regarding the verbiage used in the query results...I see words like total, actual and average being used without complete clarity about what they all mean. What does actual mean? I think it means all non-ghost zones. Does total mean all non-ghost too or everything including ghost? We could firm up terminology a bit here.

@markcmiller86
Copy link
Member

Somewhat related, #5135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) likelihood medium Neither low nor high likelihood
Projects
None yet
Development

No branches or pull requests

3 participants