-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge processes at runtime #267
base: main
Are you sure you want to change the base?
Conversation
Merge from original branch
Hi, are there any updates regarding this PR? I think it might be related to issue #291 |
Thanks for this! I had a few inline comments to the code, another couple things here: Firstly, perhaps it is possible to simplify the code by packaging the loop over processes (or merged processes) into a function which is reused. Since now we have two loops over prefit (processes, then merged processes) and two loops over postfit (same as before). They are handling slightly different objects and with slightly different treatment for pre- and postfit-, so maybe its not so simple; but much of the code is replicated and it might be nice to compactify this. Secondly, I think we would also want to add the merged processes into the loop over processes when the postfit/prefit yields are printed, right? https://github.com/cms-analysis/CombineHarvester/pull/267/files#diff-850d8a5178aa5366c05bacce2ec563fa72bfa4ce6ef5509f33b134d0dae89e5eL442 |
TH1F rand_shape = this->GetShapeInternal(lookup); | ||
for (int j = 1; j <= rand_shape.GetNbinsX(); ++j) { | ||
tmp.Form("\tBin Content %i: %f\n", j, rand_shape.GetBinContent(j)); | ||
// std::cout << tmp.Data(); | ||
full_rand_shape_summary.at(j-1) = rand_shape_summary; | ||
full_rand_shape_summary.at(j-1) += tmp.Data(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like quite a lot of information (checking the updated shape for bin for every updated parameter value). I wonder if it is maybe a bit too much, and could be avoided? But perhaps this was useful in some debugging which I'm simply not aware of. It just seems it would be a lot to even read through for large models, and perhaps targeting more specific information might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely! I originally implemented this when we generated post-fit uncertainty bands for our ttHbb analysis and so some very large fluctuations in the bands. The intend there was really to find the reason for these fluctuations, which is why I wanted to output as much information as possible.
I think the comment by Andrew is valid, and it might be better to simply change to toys themselves into a dedicated root file. I can also update this PR accordingly if we want to have something like that from the start, or we can do it at some later point in time
std::fabs(rand_shape.GetBinContent(i) - shape.GetBinContent(i)); | ||
shape.SetBinError(i, err*err + shape.GetBinError(i)); | ||
std::fabs(rand_shape.GetBinContent(j) - shape.GetBinContent(j)); | ||
if (std::fabs(err/shape.GetBinContent(j)) >= 0.5 and verbose){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be over optimizing here, but perhaps some of these extra steps could be avoided by default when not using the verbose mode? Perhaps just an if( verbose)
before this loop, and in other places, for example?
I'm also wondering if it would be useful to make this 0.5 threshold a configurable parameter?
post_shapes[bin][proc] = cmb_proc.GetShape(); | ||
} else { | ||
post_shapes[bin][proc] = | ||
sampling ? cmb_proc.GetShapeWithUncertainty(res, samples, verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deafult behaviour of PostFitShapesFromWorkspace should be to use sampling, and now use of the actual --sampling
flag prints out a warning. So the code should not check the --sampling
flag, but rather the --no-sampling
flag, and always do sampling unless --no-sampling
has been set.
Hi, I remembered I made some related changes a while ago, and never merged them. See #295 which I just opened. That adds the option to merge channels (not processes), so is in principle complementary. I tend to agree with @kcormi that the changes to |
Hi, thanks for the review and the comments! Indeed, I did most/some of these changes in the meanwhile in my local setup - I'm in the process of getting my local repository in shape to update this PR asap. Regarding the additional debugging information: Yes, totally agree that this should be completely optional (which it is in my local version of the Harvester at this point). @ajgilbert , your comment about saving the toys themselves instead of a very long string is excellent I think - I'll have a look at this as soon as this PR is updated. |
Hi! I am finally done with the update for this PR. In particular, I
Unfortunately, the PR is now a little messy since I changed quite a few things in one go (sorry). Let me know if you'd like me to change something more, sorry for the mess! |
…threshold parameter to report in verbose mode (not excessible right now)
Thanks, this now has a lot of changes :-) |
Conflicts: CombineTools/bin/PostFitShapesFromWorkspace.cpp
…merge_processes_inline
…merge_processes_inline Conflicts: CombineTools/src/CombineHarvester_Evaluate.cc
Hi @anigamova , as discussed earlier, I've updated this PR. I have cherry-picked the updates from #295 and included the latest changes from main. Therefore, this PR should now enable to merge analysis bins and/or processes at runtime :) Let me know if you have any comments! |
Hi @pkausw, It would be great to do what @ajgilbert suggested in PR #295, i.e. try to call GetShapeWithUncertainty method once, so that we don't generate 500 toys for each histogram but rather save the list of all of the shapes we want to produce and pass it to the GetShapeWithUncertainty function at the end and loop over the shapes for each toy once. I will also leave few comments where I think it would make sense to implement these changes with your PR. |
cmb.GetShape(); | ||
} else { | ||
shape_container[proc] = | ||
no_sampling ? cmb.GetShapeWithUncertainty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here to avoid sampling for each histogram you could call another method to update the list/map of all of shapes.
// std::cout << tmp.Data(); | ||
rand_shape_summary = tmp.Data(); | ||
// std::cout << rand_shape_summary.c_str() << std::endl; | ||
TH1F rand_shape = this->GetShapeInternal(lookup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the place where the loop through the list/map of shapes would go then
Suppose you have a datacard with a process class with some sub-processes. For example, in our analysis we have a data-driven control region (CR) that is derived as the difference between
data_CR
and some other processesproc1_CR
andproc2_CR
. Since the complete process in the CR is estimated during the fit, the processes need to be considered individually in the datacard for a correct parameterization. However, when interpreting the results and plotting everything, we're only interested in the total process in the CR and would like to correctly account for the post-fit correlations between the individual*_CR
processes in order to derive sensible uncertainty bands.This PR introduces an option to
PostFitShapesFromWorkspace
to merge processes at runtime. The processes can be defined with the syntaxNAME='expr'
, whereexpr
is the regular expression that is parsed to the process filter of the harvester instance.Additionally, the PR introduces an option for additional printout during the calculation of the post-fit uncertainty bands. If activated, the option will print out the difference between the bin content of the nominal post-fit template and each toy used for the construction of the uncertainty band, for each bin respectively. This functionality can be used to understand abnormally large fluctuations in the uncertainty band and is intended for debugging purposes.