Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Define a standard data subset for "micro" challenge #115

Open
Moelf opened this issue Apr 7, 2023 · 3 comments
Open

[Discussion] Define a standard data subset for "micro" challenge #115

Moelf opened this issue Apr 7, 2023 · 3 comments
Labels
analysis task concerns analysis task

Comments

@Moelf
Copy link

Moelf commented Apr 7, 2023

The idea is to define how to reduce the total events processed so people can compare results in microbenchmark settings.

1) pick first N files per process

This would reduce the number of files hit by workload, given that all the files are similarly sized nanoAOD, I worry that different process may have very different # of total files. Another problem is reducing # of files may change how some people prefer to parallelize the whole workload

2) pick first N events per file

This hit all the files like the full-version, effectively pretending each file is smaller than they really are. This hits all the files, so it's fairer from specific FS/network considerations. Again because nanoAODs are similarly sized, this should be reasonable. One problem with this approach is opening many many files in itself is a bottleneck, and this approach does not reduce time spent there.

3,4) pick first some% files / events from each process/file

similarly to 1 and 2 except we do it as %. Should be fairer and produce higher fidelity in terms of physics result, but annoying to implement...

@Moelf
Copy link
Author

Moelf commented Apr 7, 2023

in the long run we might eventually want to make slicing horizontally and vertically both possible, something like:

step\data size 10% 50% 100%
ML
Histogramming
Systematics Histogramming
Likelihood fit
Sum

@alexander-held alexander-held added the analysis task concerns analysis task label May 1, 2023
@eguiraud
Copy link
Contributor

Hi, what's the goal for the micro challenge? I have run smaller-scale versions of the AGC only as a test bed with faster iteration time, using method 1).

@Moelf
Copy link
Author

Moelf commented May 17, 2023

I think there are two goals

A) help identify different bottlenecks

ideally "file" -> "histograms" should scale linearly up to very high core/node counts. If someone's implementation doesn't show this, or only show in 1) but not 2), it's a sign something is not efficient

B) help leveling benchmarking ground

it's much easier to manage "1 beefy machine with stable I/O bandwidth and latency" than finding "a cluster with stable occupancy". So that we can benchmark between different setups without reserving a cluster and granting access to every soul out there want to benchmark AGC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis task concerns analysis task
Projects
None yet
Development

No branches or pull requests

3 participants