Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about human ratings #178

Open
jbdel opened this issue Mar 15, 2024 · 0 comments
Open

Question about human ratings #178

jbdel opened this issue Mar 15, 2024 · 0 comments

Comments

@jbdel
Copy link

jbdel commented Mar 15, 2024

Hello,
Do you happen to still have the COCO human ratings, just to see the setup. Was M1 and M2 binary, or was it a more fine-grained scale?

What is the setup to report the pearson evaluation? Is the following correct:

from scipy.stats import pearsonr
# Example BLEU scores for different systems (averaged per system on a corpus I suppose?)
bleu_scores = [0.4, 0.45, 0.5, 0.55, 0.6] 
# Corresponding human judgment scores (could be M1, M2, or any relevant metric)
# Example scores for Metric M1 (e.g., percentage of captions evaluated as better or equal to human captions)
human_judgment_scores = [0.8, 0.75, 0.9, 0.85, 0.95]  # Example scores
pearson_correlation, p_value = pearsonr(bleu_scores, human_judgment_scores)

Thank you,
JB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant