Make set of result XML files always consistent and provide parameter for selection #983

dbeyer · 2023-12-15T09:52:18Z

Given a benchmark definition with one run definition A with two task sets x and y and one run-definition B with one task-set z, then BenchExec produces ---surprizingly--- four XML result files:

for run definition A
for task set x
for task set y
for task set z

(1) just duplicates the information in (2) and (3) and can/should be omitted.

In the other direction, the pendant of (1) for (4) is missing, there is no duplicating XML results file for run definition B.

This causes (repeatedly) confusion on the users side, because implicitly and automatically one assumes a certain consistency.

PhilippWendler · 2023-12-15T10:33:26Z

Given a benchmark definition with one run definition A with two task sets x and y and one run-definition B with one task-set z, then BenchExec produces ---surprizingly--- four XML result files:
1. for run definition A

2. for task set x

3. for task set y

4. for task set z
(1) just duplicates the information in (2) and (3) and can/should be omitted.

Having (1) is on purpose and really important for lots of users. Without (1) one would always have to use a table-definition file with a <union> tag to create tables, (1) allows to just call table-generator with (1) as parameter.

In the other direction, the pendant of (1) for (4) is missing, there is no duplicating XML results file for run definition B.

Yes, because the pendant of (1) for (4) would be exactly the same file as (4), just with a different name. So this would indeed be fully redundant.

This causes (repeatedly) confusion on the users side, because implicitly and automatically one assumes a certain consistency.

I am aware of the inconsistency, but do not see a way to resolve it without breaking the standard way of using table-generator for many users, or writing out even more duplicate data to 100% redundant files.

dbeyer · 2023-12-15T12:29:30Z

Consistency has a price. The current situation is "cluttered".

It is a fundamental design principle to avoid data duplication whenever possible. (And we are generating duplicated data even without being asked for it.)

There are many possible solutions, let me name a few:

Users who want XML files for run definitions could simply concatenate the XML files for (x) and (y).
Have table-generator support that the user gives a bunch of result files, and it creates the union automatically, if this is a common use case (with parameter --union).
There could be command-line arguments to specify the level of result files, e.g., --results-for-rundefinitions and --results-for-tasksets.

If the most common use case is to use XMLs for run definitions, then we could simply suppress the files for task sets, that is, have --results-for-rundefinitions as the default, and provide --results-for-tasksets for users who want XML result files for task sets.

This would avoid duplication and support both use cases.
I doubt it is common that users want both.

To make the change non-breaking, we could go until 4.0 with the current behavior and print a deprecated warning, and support the two options from an arbitrary time onwards.

PhilippWendler · 2023-12-15T13:06:49Z

Consistency has a price. The current situation is "cluttered".

Yes, but I do not see that it is worth the price in this instance. Avoiding "clutter" is a relatively small advantage.

It is a fundamental design principle to avoid data duplication whenever possible. (And we are generating duplicated data even without being asked for it.)

I would say that this principle is useful when for example duplication can cause problems such as inconsistencies etc. But I do not see this here, and I don't think it is a more general principle to avoid this whenever possible. After all, we also produce duplicate data by for example writing out the results in different file formats (.xml, .txt) intended for different use cases. And the current duplication of run results in different kinds of XML files is similarly intended for different use cases.

There are many possible solutions, let me name a few:

Users who want XML files for run definitions could simply concatenate the XML files for (x) and (y).

As I mentioned, using the results for the run definitions is a common case (I am pretty sure it is the most common use case for table-generator - it definitively is for my own personal usage). Requiring a tedious manual XML editing step like this is an absolute no-go.

Have table-generator support that the user gives a bunch of result files, and it creates the union automatically, if this is a common use case (with parameter --union).

I have considered this already, and there exists #977. But another (likely the second most) common use case is to create a table comparing two tools or two tool configs or two tool revisions. And as soon as we want to have two or more run sets next to each other in the table, any solution that requires to pass more than one file per run set on the command line would not be practicable anymore.

Basically we have the following use cases:

table with single run set: highest priority, needs to be supported by a simple command line
table with two or more run sets: second highest priority, needs to be supported with a practically usable command line
table with single run set coming from several files (--union command-line argument for table-generator #977): low priority, so far fine to be supported only via table definitions, but command-line argument is easy to add
arbitrary tables: potentially more common than 3., but due to complexity no goal to support it via command-line arguments, only via table definitions

There could be command-line arguments to specify the level of result files, e.g., --results-for-rundefinitions and --results-for-tasksets.

This is possible.

This would avoid duplication and support both use cases. I doubt it is common that users want both.

I agree that it is likely not so common that users want both for a specific benchmark execution. However, I assume it will certainly happen that people will simply forget to pass --results-for-tasksets and when they notice afterwards they won't like that they either have to re-run the benchmarks or go editing the result files manually (and filtering result files per task set won't be a something quick to do).

So there is a real cost for 3. as well (in addition to the maintenance cost of more code paths and users having to learn to use yet another option).

Besides for some slight potential for confusion, I do not see any actual problem (such as likely danger of using wrong sets of results or inconsistencies) caused by the duplication, so the current behavior seems to have less severe disadvantages than the alternatives.

dbeyer · 2023-12-15T14:53:27Z

Yes, but I do not see that it is worth the price in this instance.

Depends on how valuable you rate my work time. I have wasted significant amounts of time over the years,
and this year again, because I cannot remember this exception to consistency and repeatedly fall into that trap.
(And to be honest, I write this issue because I do not want again.)

I would say that this principle is useful when for example duplication can cause problems such as inconsistencies etc.

Exactly, it caused extreme problems, for example, a two days delay in getting SV-COMP results reproducible.

But I do not see this here,

I hope now.

and I don't think it is a more general principle to avoid this whenever possible.

Text-book knowledge, I would say.

After all, we also produce duplicate data by for example writing out the results in different file formats (.xml, .txt) intended for different use cases.

I also do not like this! But this did not yet cause significant problems to me. (In table-generator, I have a command-line switch to get rid of that clutter. I want only HTML, not both HTML and CSV, so I can choose, and beginners get both to see what BenchExec can offer.)

But I would prefer that users would tell want they want, instead of what they do not want.

And the current duplication of run results in different kinds of XML files is similarly intended for different use cases.

Is it not! Here you talk about different formats of the same data. There you talk about the same data in the same format.

So there is a real cost for 3. as well (in addition to the maintenance cost of more code paths and users having to learn to use yet another option).

Nothing compared to loss of time due to confusion regarding data duplication, missing files, etc.

I do not see any actual problem (such as likely danger of using wrong sets of results or inconsistencies) caused by the duplication,

Why do you not see this?

dbeyer · 2023-12-15T15:10:30Z

Let me add a different perspective, because you mentioned several times that you do not see the problem.

The use cases that you mention are those of light-weight users. I am not against supporting those ---BenchExec should not only be for power users---, and I made a constructive proposal how to achieve to support those use cases as well.

But the BenchExec project was mainly motivated and developed to support "power users":

I have processed 15 million runs with BenchExec, just in the last three weeks.
Only the final, official execution of the competition SV-COMP 2024 has produced more than 18 000 XML result files.

I hope this illustration helps to understand that there is no way to find it acceptable to have extra unnecessary files with duplicated data hanging around.
The processing and analysis of those data are complex, and such a user does not want to bother with unnecessary duplicated data.

The minimal (and perhaps perfectly fine) solution would be a command-line switch to suppress the output of one kind of files, just like I can tell table-generator to omit the CSV output.

PhilippWendler · 2023-12-15T15:39:30Z

The use cases that you mention are those of light-weight users. I am not against supporting those ---BenchExec should not only be for power users---, and I made a constructive proposal how to achieve to support those use cases as well.

Thinking about power users and normal users is a good idea.

Our goal should be to make simple things easy for standard users, and complex things possible for power users.

This implies that defaults should suite standard users well, because power users are supposed to know what they need and be able to specify that.

But the BenchExec project was mainly motivated and developed to support "power users":

I do not see one or the user group as more important than the other.

Only the final, official execution of the competition SV-COMP 2024 has produced more than 18 000 XML result files.

I hope this illustration helps to understand that there is no way to find it acceptable to have extra unnecessary files with duplicated data hanging around. The processing and analysis of those data are complex, and such a user does not want to bother with unnecessary duplicated data.

Side node: I find this far from "unacceptable", this word is much too strong here, and using such exaggerations does not help. I want to put this into perspective: "unacceptable" would be stuff that makes it likely to get wrong results, or makes BenchExec unusable for users. Here we are talking about fine-tuning some details of the output to make some use cases more convenient (it is not even a feature request that would enable more use cases than possible right now). And you seem to agree, after all you labeled this "low priority" yourself. So this can't be "unacceptable" practically by definition.

And of course there is a workaround that power users can apply: delete the unnecessary files right after the benchmark execution, so that the space is wasted only for a short time.

The minimal (and perhaps perfectly fine) solution would be a command-line switch to suppress the output of one kind of files, just like I can tell table-generator to omit the CSV output.

This is a new suggestion, not brought up so far. Adding something like --suppress-merged-result-files or --only-basic-result-files (maybe with some better adjective than "basic"? "elementary"?) would not have negative effects on standard users and still provide the possibility to achieve what they want to power users, so it fits into the above goal. This is much better than the previous suggestions.

dbeyer · 2023-12-15T16:15:11Z

I do not see one or the user group as more important than the other.

I felt the need to emphasize the one user group because you were so strongly arguing in favor of the other, not acknowledging sufficiently the needs of the first.

I agree with everything until the side note.

I disagree with the side note: I did not yet succeed to convince you that this does lead to wrong results.

If a user looks at thousands of such files (there are non-expert BenchExec users looking at competition artifacts),
the user is able to detect a pattern: everything appears to be there twice, and then the user has to decide to go for either rundefinitions or tasksets. If the choice is for rundefinitions, then the user will end up with wrong results, because there are files missing for the rundefinitions.

The rundefinition files are missing in some situations (when there is only one task set in a particular rundefinition).
This is an inconsistency and thus a problem.

I am against calling my problem report an exaggeration.
The inconsistency of the result files is unacceptable. The BenchExec project has the goal to provide reliable benchmarking. This is not reliable, but fragile. The notion of reliable does not only entail reliable measurement and limits, but also reliable management of the results.

I used priority "low" because the priority is a combination of urgency and importance. It is not urgent, because for SV-COMP 2024 I have to live with the ~~mess of files~~ partially duplicated and sometimes missing result files.

And no, deleting files that BenchExec produced is not an option for me, for several reasons. (And the space is wasted on copy-on-write file systems, but disk space is not my concern.)

Also, this would lead to wrong results: If the user removes all the taskset files and keeps only the rundefinition files, then the user has effectively lost results: all those that were only in taskset results, but had the rundefinition files missing.

Regarding the naming of the command-line parameters: I suggest to use the notions that describe the result precisely: There is no notion of merging, and no notion of basic. But there is a notion of run definition and task set. Those should be used. And note that --suppress if implemented as it says leads to the wrong result: If --suppress-tast-set-results does what it says, then the now missing rundefinitions are still missing.

Last not least, I do not like the change of the title:
"Parameter for consistently named set of result XML files with no duplication" would be an appropriate name for the pull request to fix this issue.

And the parameter is not for the consistent naming, but for consistent contents.

This issue is about "Inconsistent/duplicated generation of XML result files", which is not yet addressed, but discussed.
Let us please revert the title.

dbeyer · 2023-12-15T16:36:29Z

Regarding deleting: I would have to delete 2373 result XML files for rundefinitions.

I am tempted to remove them, in order to provide a clean artifact to users.
But (a) who guarantees that I am not removing too much, and (b) I consider it important
to keep all XML files that somebody would obtain using the same BenchExec command that I used to produce them.
All result XMLs that would be reproduced, should be in the artifact as well.

dbeyer · 2023-12-15T20:38:18Z

For the route using the --suppress (my proposal 4), it would require two steps:
(a) add code for writing the now missing files for run definitions, in order to consistently always have those result files, and
(b) add the command-line switches to turn off the output of one of the two kinds.
This would not cause many "more code paths" that would be difficult to maintain.

I think this is fairly simple to implement. I can create a pull request if necessary.

dbeyer · 2023-12-15T20:46:58Z

Orthogonally to everything said so far: The use case for the (light-weight) user that you mentioned Philipp "benchexec produces one results file, and this can be given to table-generator without further action by the user" is currently not supported. The user gets several XML results files (one for each run definition, one for each task set), but not for all results. So there should be three levels of results files for such users:

benchmark
run definition
task set

Just an observation. I have no strong opinion on this.

PhilippWendler · 2023-12-18T06:01:15Z

For the route using the --suppress (my proposal 4), it would require two steps: (a) add code for writing the now missing files for run definitions, in order to consistently always have those result files, and (b) add the command-line switches to turn off the output of one of the two kinds. This would not cause many "more code paths" that would be difficult to maintain.

Not sure if I understand it correctly, but the files that you refer to as "missing files" should not be added in default mode, because their content would be exactly the same as the already existing files, so pure redundancy. The switch would (a) enable all task-definition result files and (b) disable all run-definition result files.

I think this is fairly simple to implement. I can create a pull request if necessary.

This is of course always welcome!

Orthogonally to everything said so far: The use case for the (light-weight) user that you mentioned Philipp "benchexec produces one results file, and this can be given to table-generator without further action by the user" is currently not supported. The user gets several XML results files (one for each run definition, one for each task set),

Yes, but benchexec prints a command-line suggestion for how to call table-generator in the end, and that suggestion contains the file name that these users need, so they just need to copy and paste it, they do not need to worry what the other files are as long as they don't need them.

but not for all results. So there should be three levels of results files for such users:
* benchmark
* run definition
* task set

If you have more than one run definition in a benchmark definition, then the primary use case for this is to benchmark different configurations of the same tool. So when creating tables, these results per run definition should be put next to each other, and for that we exactly have the required file output. And the above mentioned command-line suggestion from benchexec covers this case prints a command line that will produce the desired table.

The fact that SV-COMP uses different run definitions for different sets of tasks instead of different configuration is unfortunately currently necessary for technical reasons (#8), but not the common case.

Btw., we also do not have a common output file with all results for the case where you give more than one benchmark definition to benchexec. This is also fine.

dbeyer · 2023-12-18T06:40:06Z

Proposal:

Fix the default behavior to output only (and all!) results for run-definitions.
This covers what you see as the most common use case.
File names: ....<name of run definition>.xml.bz2
Add parameter --results-per-taskset that changes the default behavior to output results
for task-set results (and only those).
I would not call it suppress or so, because it is nice to tell what the user wants, not what is not wanted.
File names: ....<name of run definition>.<name of task set>.xml.bz2

PhilippWendler · 2023-12-18T07:06:19Z

I have explained in #983 (comment) already that I do not see this as a good solution: Users will inevitably forget to pass --results-per-taskset at some point, and then they are stuck with either having to re-execute everything or manually fiddle with the result files and extract result sets per task set (which is not trivial).

PhilippWendler · 2023-12-18T15:58:40Z

Result of offline discussion:

We make the set of produced result XML files always consisting by changing the default to always producing all result files (on run-definition and task-set level), even for the case where it is currently missing (1 task set per run definition) and some output files will be identical.
We provide command-line parameters for choosing whether a user wants only result XML files on the run-definition or task-set level.

A use case for the latter would for example also be CPAchecker's CI, where we only want result files on the run-definition level.

The name of the parameter(s) is still open, but as usual we want a positive term (no negation) and prefer to avoid abbreviations.

dbeyer added low priority usability labels Dec 15, 2023

PhilippWendler changed the title ~~Inconsistent/duplicated generation of XML result files~~ Parameter for consistently named set of result XML files with no duplication Dec 15, 2023

PhilippWendler changed the title ~~Parameter for consistently named set of result XML files with no duplication~~ Make set of result XML files always consistent and provide parameter for selection Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make set of result XML files always consistent and provide parameter for selection #983

Make set of result XML files always consistent and provide parameter for selection #983

dbeyer commented Dec 15, 2023

PhilippWendler commented Dec 15, 2023

dbeyer commented Dec 15, 2023 •

edited

Loading

PhilippWendler commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023 •

edited

Loading

PhilippWendler commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023

dbeyer commented Dec 15, 2023

dbeyer commented Dec 15, 2023

PhilippWendler commented Dec 18, 2023

dbeyer commented Dec 18, 2023

PhilippWendler commented Dec 18, 2023

PhilippWendler commented Dec 18, 2023

Make set of result XML files always consistent and provide parameter for selection #983

Make set of result XML files always consistent and provide parameter for selection #983

Comments

dbeyer commented Dec 15, 2023

PhilippWendler commented Dec 15, 2023

dbeyer commented Dec 15, 2023 • edited Loading

PhilippWendler commented Dec 15, 2023 • edited Loading

dbeyer commented Dec 15, 2023 • edited Loading

dbeyer commented Dec 15, 2023 • edited Loading

PhilippWendler commented Dec 15, 2023 • edited Loading

dbeyer commented Dec 15, 2023 • edited Loading

dbeyer commented Dec 15, 2023

dbeyer commented Dec 15, 2023

dbeyer commented Dec 15, 2023

PhilippWendler commented Dec 18, 2023

dbeyer commented Dec 18, 2023

PhilippWendler commented Dec 18, 2023

PhilippWendler commented Dec 18, 2023

dbeyer commented Dec 15, 2023 •

edited

Loading

PhilippWendler commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023 •

edited

Loading

PhilippWendler commented Dec 15, 2023 •

edited

Loading

dbeyer commented Dec 15, 2023 •

edited

Loading