pass@k for Bash should be higher #57

PootieT · 2023-04-21T13:51:08Z

PootieT
Apr 21, 2023

Failure Case 1

Not sure if this is expected failure case of unit tests in bash. Here is an example HumanEval_45_triangle_area

#!/bin/bash
#
#
# $1 is an integer
# $2 is an integer
triangle_area() {
    echo "$1 * $2 / 2.0" | bc -l
}


candidate() {
    triangle_area "$@"
}

set -e
run_test() {
    [[ $(candidate "5" "3") = "7.5" ]]
    [[ $(candidate "2" "2") = "2.0" ]]
    [[ $(candidate "10" "8") = "40.0" ]]
}

run_test

If we print the output of $(candidate "5" "3"), it is "7.500000000", and it is different from the expected "7.5", tests fails. Maybe something with bc to evaluate the numeric value of the strings instead of comparing strings?

Failure Case 2

HumanEval_42_incr_list

#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
    for e in $1; do
        echo $((e + 1))
    done
}


candidate() {
    incr_list "$@"
}

set -e
run_test() {
#    [[ $(candidate "") = "" ]]
    echo $(candidate "3 2 1")   # prints -> 4 3 2\n
#    [[ $(candidate "3 2 1") = "4 3 2" ]]
    echo $(candidate "5 2 5 2 3 3 9 0 123")  # prints -> 6 3 6 3 4 4 10 1 124\n
    [[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124" ]]
}

run_test

first test passes, second and third tests fail. And so I printed out the output of each cases.

I tried adding the newline character \n to the end of expected values and that didn't work. My lack of knowledge in Bash is not giving me any idea how it might be fixed.. but I don't think this should fail?

arjunguha · 2023-04-21T16:16:45Z

arjunguha
Apr 21, 2023
Maintainer

Tagging @mgree

Regarding Case 1: I'm not even sure what the right thing to do here is! What you get will depend on what tools the generated script will shell out to.

For example, here is another solution:

incr_list {
  python3 -c "print(5 * 3 / 2)"
}

This produces "7.5\n".

0 replies

arjunguha · 2023-04-21T16:29:50Z

arjunguha
Apr 21, 2023
Maintainer

About Case 2: Here is my hand-written fix. I've edit both the solution and the tests:

#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
    for e in $1; do
        echo -n "$((e + 1)) "
    done
}


candidate() {
    incr_list "$@"
}

run_test() {
    [[ $(candidate "") = "" ]]
    echo $?
    [[ $(candidate "3 2 1") = "4 3 2 " ]]
    echo $?
    [[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124 " ]]
    echo $?
}

run_test

This produces:

$ bash incrlist.sh 
0
0
0

I am not sure it is reasonable to prompt a model to produce this solution. I also think its worse than the model generated solution. I think exact-matching on Bash results is a losing proposition. What we should instead do is something fuzzier, but that will be tricky to automate in the MultiPL-E style.

0 replies

arjunguha · 2023-04-21T16:38:58Z

arjunguha
Apr 21, 2023
Maintainer

Also, I have tons solutions where it produces Python, from all sorts of models.

0 replies

PootieT · 2023-04-21T16:46:52Z

PootieT
Apr 21, 2023
Author

I see, yeah I don't see an obvious solution to these problems, perhaps steering away from Bash might be the best solution. Thanks!

0 replies

arjunguha · 2023-04-21T16:54:18Z

arjunguha
Apr 21, 2023
Maintainer

I'm going to leave this issue open. It's a warning about interpreting the bash results.

0 replies

mgree · 2023-04-22T15:29:08Z

mgree
Apr 22, 2023
Collaborator

I think it would not be too hard to write a tester that uses, e.g., bc to check floating point arithmetic in shell scripts:

$ printf "0.5 == 1 / 2\n0.5 == 1/3" | bc -l
1
0

It would be up to the model to generate code that would produce strings that would be correctly interpreted as floating point numbers, though.

0 replies

mgree · 2023-04-22T15:32:44Z

mgree
Apr 22, 2023
Collaborator

Failure Case 2

HumanEval_42_incr_list

#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
    for e in $1; do
        echo $((e + 1))
    done
}

This loop is outputting a newline after each number. Maybe printf "%d " $((e + 1))? with a printf "\n" at the end? (echo -n isn't portable.)

candidate() {
    incr_list "$@"
}

set -e
run_test() {
#    [[ $(candidate "") = "" ]]
    echo $(candidate "3 2 1")   # prints -> 4 3 2\n
#    [[ $(candidate "3 2 1") = "4 3 2" ]]
    echo $(candidate "5 2 5 2 3 3 9 0 123")  # prints -> 6 3 6 3 4 4 10 1 124\n
    [[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124" ]]
}

run_test
first test passes, second and third tests fail. And so I printed out the output of each cases.

I tried adding the newline character \n to the end of expected values and that didn't work. My lack of knowledge in Bash is not giving me any idea how it might be fixed.. but I don't think this should fail?

The test is failing because of the added newlines. The command substitution $(candidate "5 2 5 ...") will trim the final newline, but not the ones in the middle.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass@k for Bash should be higher #57

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Failure Case 2

Select a reply

pass@k for Bash should be higher #57

PootieT Apr 21, 2023

Failure Case 1

Failure Case 2

Replies: 7 comments

arjunguha Apr 21, 2023 Maintainer

arjunguha Apr 21, 2023 Maintainer

arjunguha Apr 21, 2023 Maintainer

PootieT Apr 21, 2023 Author

arjunguha Apr 21, 2023 Maintainer

mgree Apr 22, 2023 Collaborator

mgree Apr 22, 2023 Collaborator

Failure Case 2

PootieT
Apr 21, 2023

arjunguha
Apr 21, 2023
Maintainer

arjunguha
Apr 21, 2023
Maintainer

arjunguha
Apr 21, 2023
Maintainer

PootieT
Apr 21, 2023
Author

arjunguha
Apr 21, 2023
Maintainer

mgree
Apr 22, 2023
Collaborator

mgree
Apr 22, 2023
Collaborator