Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with class imbalance in Deep Learning #193

Open
1 of 2 tasks
souravsingh opened this issue Jul 21, 2019 · 4 comments
Open
1 of 2 tasks

Dealing with class imbalance in Deep Learning #193

souravsingh opened this issue Jul 21, 2019 · 4 comments

Comments

@souravsingh
Copy link
Contributor

souravsingh commented Jul 21, 2019

Have you checked the list of proposed tips to see if the tip has already been proposed?

  • Yes

Did you add yourself as a contributor by making a pull request if this is your first contribution?

  • Yes, I added myself or am already a contributor

Feel free to elaborate, rant, and/or ramble.
There might be a imbalance in the class distribution, which is quite common in Bioinformatics problems. I believe most of the points regarding dealing with imbalance in ML should work in Deep Learning as well-

  1. Try rephrasing the problem
  2. Obtain more data
  3. Tweak weights appropriately for class imbalance
  4. Applying Regularization techniques
  5. Use Oversampling or Undersampling techniques(?)
  6. Using K-fold CV in the correct way

Any citations for the rule? (peer-reviewed literature preferred but not required)

@agitter
Copy link
Collaborator

agitter commented Jul 22, 2019

I agree that class imbalance is a common issue in biology. How much of the discussion would be specific to deep learning as opposed to general ML? If the solutions are general, we may only mention it briefly instead of making a full tip.

Do the solutions of rephrasing the problem and obtaining more data apply in biology? In settings like genome annotation or chemical bioactivity classification, the domain is inherently dominated by negatives regardless of how much data we acquire.

This topic also fits with the brief sentence we have now about ROC having limited utility for class imbalanced problems.

@rasbt
Copy link
Collaborator

rasbt commented Jul 22, 2019

I agree that class imbalance is a common issue in biology. How much of the discussion would be specific to deep learning as opposed to general ML? If the solutions are general, we may only mention it briefly instead of making a full tip.

Good point. I am not sure how successful this is in general, but I stumbled upon a paper recently where the researchers used GANs to generate synthetic samples for addressing the imbalance issue. However, in general, I think DL is not more prone or immune to imbalancing then other ML approaches.

One approach though that is more DL specific is the Focal Loss that was first proposed for the RetinaNet, for example.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988). (https://arxiv.org/abs/1708.02002)

Screen Shot 2019-07-22 at 4 26 32 PM

@souravsingh
Copy link
Contributor Author

I believe obtaining more data points can help for certain problems like problems in cancer genomics, where a lab could tap into the private data generated to help solve the problem.

@souravsingh
Copy link
Contributor Author

In line with @rasbt comment on GANs, I remember reading a paper which used RNNs to generate protein sequences having a certain type of activity. We could mention this as part of how to get more data samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants