Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Handle Variable Types #16

Open
jrm5100 opened this issue Jul 18, 2019 · 2 comments
Open

Better Handle Variable Types #16

jrm5100 opened this issue Jul 18, 2019 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@jrm5100
Copy link
Contributor

jrm5100 commented Jul 18, 2019

I think a good test-case for this will be creating a datatype for a variant based on some PLINK files:

  • BED (00/01/10/11 encoding for each variant for each sample)
  • BIM (chromosome, ID, position, coordinate, allele 1, allele 2)

I think the BIM information would be part of a pandas.api.extensions.ExtensionArray type (since the information would apply to an array/column) and the actual genotype information would be part of a pandas.api.extensions.ExtensionDtype type.

@jrm5100
Copy link
Contributor Author

jrm5100 commented Aug 5, 2019

Currently the variable types are converted from the underlying numerical types.

Binary = Categorical with 2 categories
Categorical = Categorical with > 2 categories
Continuous= Anything numeric that hasn't been made into a 'category' type
Unknown = 'object' type which occurs when two different types are combined, or when there are strings.

  • Right now converting to categorical actually converts to binary if there are only two unique values.
  • Ideally the starting type would always be "unknown".

The custom datatypes mentioned above could be a good solution (and would allow for more types in the future as well).

@jrm5100 jrm5100 changed the title Handle Genotype Data Better Handle Variable Types Aug 5, 2019
@jrm5100 jrm5100 self-assigned this Jun 4, 2020
@jrm5100 jrm5100 added the enhancement New feature or request label Jun 4, 2020
@jrm5100
Copy link
Contributor Author

jrm5100 commented Aug 3, 2020

As of Pandas v1.1, any value can be converted to the "String" type. This could be treated as "Unknown" in CLARITE and loaded by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant