You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
Do you happen to still have the COCO human ratings, just to see the setup. Was M1 and M2 binary, or was it a more fine-grained scale?
What is the setup to report the pearson evaluation? Is the following correct:
from scipy.stats import pearsonr
# Example BLEU scores for different systems (averaged per system on a corpus I suppose?)
bleu_scores = [0.4, 0.45, 0.5, 0.55, 0.6]
# Corresponding human judgment scores (could be M1, M2, or any relevant metric)
# Example scores for Metric M1 (e.g., percentage of captions evaluated as better or equal to human captions)
human_judgment_scores = [0.8, 0.75, 0.9, 0.85, 0.95] # Example scores
pearson_correlation, p_value = pearsonr(bleu_scores, human_judgment_scores)
Thank you,
JB
The text was updated successfully, but these errors were encountered:
Hello,
Do you happen to still have the COCO human ratings, just to see the setup. Was M1 and M2 binary, or was it a more fine-grained scale?
What is the setup to report the pearson evaluation? Is the following correct:
Thank you,
JB
The text was updated successfully, but these errors were encountered: