Question about human ratings #178

jbdel · 2024-03-15T16:32:56Z

Hello,
Do you happen to still have the COCO human ratings, just to see the setup. Was M1 and M2 binary, or was it a more fine-grained scale?

What is the setup to report the pearson evaluation? Is the following correct:

from scipy.stats import pearsonr
# Example BLEU scores for different systems (averaged per system on a corpus I suppose?)
bleu_scores = [0.4, 0.45, 0.5, 0.55, 0.6] 
# Corresponding human judgment scores (could be M1, M2, or any relevant metric)
# Example scores for Metric M1 (e.g., percentage of captions evaluated as better or equal to human captions)
human_judgment_scores = [0.8, 0.75, 0.9, 0.85, 0.95]  # Example scores
pearson_correlation, p_value = pearsonr(bleu_scores, human_judgment_scores)

Thank you,
JB

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about human ratings #178

Question about human ratings #178

jbdel commented Mar 15, 2024

Question about human ratings #178

Question about human ratings #178

Comments

jbdel commented Mar 15, 2024