-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric for enformer #9
Comments
Let's assume the data are as follows: |
Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli ***@***.***> wrote:
Let's assume the data are as follows:
batch1 : input1, target1
batch2 : input2, target2
batch3 : input3, target3
The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows:
cor(c(input1,input2,input3),c(target1,target2,target3))
The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows:
mean(cor(input1,target1),cor(input2,target2),cor(input3,target3))
I think the second option is reasonable.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Interesting. I mean the thing that seemed off was that your proposal was over arbitrary cut points. Taking the mean after splitting on a nuisance variable on the other hand could make a ton of sense. That could help control for confounding for example. You could cut the data up by something you don’t want to be included in your correlation measurement. Chromosome boundary for example, maybe GC percent bucket or some other feature you think is a nuisance variable that is not biologically meaningful. Then you could calculate correlation within each group and average. Smaller groups though might be noisier which would make real signals harder to detect. Just some other thoughts!Sent from my iPhoneOn Sep 29, 2024, at 7:07 PM, Eli ***@***.***> wrote:
Sorry, I think you're right. Calculating the correlation based on batches and then taking the average is not a good idea as it ignores the global distribution of the data.
2e3d512f6e1e61a2916bf4584a93309.jpg (view on web)
Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks! That helps a lot! |
Hello, can I ask how you find of the human pearson R is 0.625 for validation, and 0.65 for test? Couldn't find any information in the paper. Is there any other place that records this?
The text was updated successfully, but these errors were encountered: