You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comparing performance among tools is very hard, and becomes even harder if there is little knowledge of the training data that was used for each model. Therefore adding a consensus way to split data in a reproducible manner might be critical down the road to have consistent assurance that a specific peptide has never been seen by the model.
Proposal
Generate a hashing function that gives a number to every peptide sequence, that in average will generate a randomly uniform distribution, therefore could be used for percentage train/test splits.
In other words, propose a way to convert the train = [random() > 0.8 for x in PEPTIDES] pattern to train = [hash(x) > 0.8 for x in PEPTIDES].
Make a reference implementation and testing examples that can be re-implemented in any programming language/framework.
Guidelines for specific sequences that should not be trained on. (IRT peptides/procal, that should be used as landmarks and not training sequences ??)
I would recommend NEVER to train anything with hash in the range of [0.9-1], and generally discourage > 0.8 as well as Biognosys iRT peptides/procal peptides ...
Some internal testing to verify that compositions/motifs are not being over-represented in a systematic way.
First implementation
This is the hashing that I use in my model to accomplish this task (with some minor modifications for readability)
SplitSet=Literal["Train", "Test", "Val"]
# Generated using {x:hash(x) for x in CONFIG.encoding_aa_order} once# and then hard-codedHASHDICT= {
"A": 8990350376580739186,
"C": -5648131828304525110,
"D": 6043088297348140225,
"E": 2424930106316864185,
"F": 7046537624574876942,
"G": 3340710540999258202,
"H": 6743161139278114243,
"I": -3034276714411840744,
"K": -6360745720327592128,
"L": -5980349674681488316,
"M": -5782039407703521972,
"N": -5469935875943994788,
"P": -9131389159066742055,
"Q": -3988780601193558504,
"R": -961126793936120965,
"S": 8601576106333056321,
"T": -826347925826021181,
"V": 6418718798924587169,
"W": -3331112299842267173,
"X": -7457703884378074688,
"Y": 2606728663468607544,
}
defselect_split(pep: str) ->SplitSet:
"""Assigns a peptide to a split set based on its sequence It selects all iRT peptides to the 'Val' set. The rest of the peptides are hashed based on their stripped sequence (no mods). It is done on a semi-random basis Args: pep str: Peptide to assign to a split set Returns: SplitSet: Split set to assign the peptide to. This is either one of "Train", "Test" or "Val" Examples: >>> select_split("AAA") 'Train' >>> select_split("AAAK") 'Test' >>> select_split("AAAKK") 'Train' >>> select_split("AAAMTKK") 'Train' """num_hash=sum(HASHDICT[x] forxinpep)
num_hash=num_hash/1e4num_hash=num_hash%1assert0<=num_hash<=1return_select_split(pep, num_hash)
# This is a hard-coded set of peptides that I use as landmarks to align# My retention times but actively exclude in training.def_select_split(pep: str, num_hash: float):
in_landmark=pepinIRT_PEPTIDESifnum_hash>0.8orin_landmark:
return"Val"elifnum_hash>0.6:
return"Test"else:
return"Train"
I think it would be great if we could have a discussion on a good way to do this and IDEALLY have an implementation of something like this in our MS-related training frameworks.
(I will progressively add people to the conversation but feel free to add anyone in the community whose input should be included here).
The text was updated successfully, but these errors were encountered:
Context
Comparing performance among tools is very hard, and becomes even harder if there is little knowledge of the training data that was used for each model. Therefore adding a consensus way to split data in a reproducible manner might be critical down the road to have consistent assurance that a specific peptide has never been seen by the model.
Proposal
Generate a hashing function that gives a number to every peptide sequence, that in average will generate a randomly uniform distribution, therefore could be used for percentage train/test splits.
In other words, propose a way to convert the
train = [random() > 0.8 for x in PEPTIDES]
pattern totrain = [hash(x) > 0.8 for x in PEPTIDES]
.Make a reference implementation and testing examples that can be re-implemented in any programming language/framework.
Guidelines for specific sequences that should not be trained on. (IRT peptides/procal, that should be used as landmarks and not training sequences ??)
> 0.8
as well as Biognosys iRT peptides/procal peptides ...First implementation
This is the hashing that I use in my model to accomplish this task (with some minor modifications for readability)
I think it would be great if we could have a discussion on a good way to do this and IDEALLY have an implementation of something like this in our MS-related training frameworks.
(I will progressively add people to the conversation but feel free to add anyone in the community whose input should be included here).
The text was updated successfully, but these errors were encountered: