[Discussion] Define a standard data subset for "micro" challenge #115

Moelf · 2023-04-07T18:06:37Z

The idea is to define how to reduce the total events processed so people can compare results in microbenchmark settings.

1) pick first N files per process

This would reduce the number of files hit by workload, given that all the files are similarly sized nanoAOD, I worry that different process may have very different # of total files. Another problem is reducing # of files may change how some people prefer to parallelize the whole workload

2) pick first N events per file

This hit all the files like the full-version, effectively pretending each file is smaller than they really are. This hits all the files, so it's fairer from specific FS/network considerations. Again because nanoAODs are similarly sized, this should be reasonable. One problem with this approach is opening many many files in itself is a bottleneck, and this approach does not reduce time spent there.

3,4) pick first some% files / events from each process/file

similarly to 1 and 2 except we do it as %. Should be fairer and produce higher fidelity in terms of physics result, but annoying to implement...

Moelf · 2023-04-07T18:14:33Z

in the long run we might eventually want to make slicing horizontally and vertically both possible, something like:

step\data size	10%	50%	100%
ML
Histogramming
Systematics Histogramming
Likelihood fit
Sum

eguiraud · 2023-05-16T21:55:40Z

Hi, what's the goal for the micro challenge? I have run smaller-scale versions of the AGC only as a test bed with faster iteration time, using method 1).

Moelf · 2023-05-17T14:48:16Z

I think there are two goals

A) help identify different bottlenecks

ideally "file" -> "histograms" should scale linearly up to very high core/node counts. If someone's implementation doesn't show this, or only show in 1) but not 2), it's a sign something is not efficient

B) help leveling benchmarking ground

it's much easier to manage "1 beefy machine with stable I/O bandwidth and latency" than finding "a cluster with stable occupancy". So that we can benchmark between different setups without reserving a cluster and granting access to every soul out there want to benchmark AGC

alexander-held added the analysis task concerns analysis task label May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Define a standard data subset for "micro" challenge #115

[Discussion] Define a standard data subset for "micro" challenge #115

Moelf commented Apr 7, 2023 •

edited

Loading

Moelf commented Apr 7, 2023 •

edited

Loading

eguiraud commented May 16, 2023

Moelf commented May 17, 2023 •

edited

Loading

[Discussion] Define a standard data subset for "micro" challenge #115

[Discussion] Define a standard data subset for "micro" challenge #115

Comments

Moelf commented Apr 7, 2023 • edited Loading

1) pick first N files per process

2) pick first N events per file

3,4) pick first some% files / events from each process/file

Moelf commented Apr 7, 2023 • edited Loading

eguiraud commented May 16, 2023

Moelf commented May 17, 2023 • edited Loading

A) help identify different bottlenecks

B) help leveling benchmarking ground

Moelf commented Apr 7, 2023 •

edited

Loading

Moelf commented Apr 7, 2023 •

edited

Loading

Moelf commented May 17, 2023 •

edited

Loading