[Pitch] Reproducible train/test splitting for peptides #23

jspaezp · 2023-06-08T20:29:38Z

Context

Comparing performance among tools is very hard, and becomes even harder if there is little knowledge of the training data that was used for each model. Therefore adding a consensus way to split data in a reproducible manner might be critical down the road to have consistent assurance that a specific peptide has never been seen by the model.

Proposal

Generate a hashing function that gives a number to every peptide sequence, that in average will generate a randomly uniform distribution, therefore could be used for percentage train/test splits.

In other words, propose a way to convert the train = [random() > 0.8 for x in PEPTIDES] pattern to train = [hash(x) > 0.8 for x in PEPTIDES].

Make a reference implementation and testing examples that can be re-implemented in any programming language/framework.
Guidelines for specific sequences that should not be trained on. (IRT peptides/procal, that should be used as landmarks and not training sequences ??)

I would recommend NEVER to train anything with hash in the range of [0.9-1], and generally discourage > 0.8 as well as Biognosys iRT peptides/procal peptides ...

Some internal testing to verify that compositions/motifs are not being over-represented in a systematic way.

First implementation

This is the hashing that I use in my model to accomplish this task (with some minor modifications for readability)

SplitSet = Literal["Train", "Test", "Val"]

# Generated using {x:hash(x) for x in CONFIG.encoding_aa_order} once
# and then hard-coded
HASHDICT = {
    "A": 8990350376580739186,
    "C": -5648131828304525110,
    "D": 6043088297348140225,
    "E": 2424930106316864185,
    "F": 7046537624574876942,
    "G": 3340710540999258202,
    "H": 6743161139278114243,
    "I": -3034276714411840744,
    "K": -6360745720327592128,
    "L": -5980349674681488316,
    "M": -5782039407703521972,
    "N": -5469935875943994788,
    "P": -9131389159066742055,
    "Q": -3988780601193558504,
    "R": -961126793936120965,
    "S": 8601576106333056321,
    "T": -826347925826021181,
    "V": 6418718798924587169,
    "W": -3331112299842267173,
    "X": -7457703884378074688,
    "Y": 2606728663468607544,
}

def select_split(pep: str) -> SplitSet:
    """Assigns a peptide to a split set based on its sequence

    It selects all iRT peptides to the 'Val' set.
    The rest of the peptides are hashed based on their stripped sequence (no mods).
    It is done on a semi-random basis

    Args:
        pep str: Peptide to assign to a split set

    Returns:
        SplitSet: Split set to assign the peptide to.
            This is either one of "Train", "Test" or "Val"

    Examples:
        >>> select_split("AAA")
        'Train'
        >>> select_split("AAAK")
        'Test'
        >>> select_split("AAAKK")
        'Train'
        >>> select_split("AAAMTKK")
        'Train'

    """
    num_hash = sum(HASHDICT[x] for x in pep)
    num_hash = num_hash / 1e4
    num_hash = num_hash % 1
    assert 0 <= num_hash <= 1
    return _select_split(pep, num_hash)

# This is a hard-coded set of peptides that I use as landmarks to align
# My retention times but actively exclude in training.
def _select_split(pep: str, num_hash: float):
    in_landmark = pep in IRT_PEPTIDES
    if num_hash > 0.8 or in_landmark:
        return "Val"
    elif num_hash > 0.6:
        return "Test"
    else:
        return "Train"

I think it would be great if we could have a discussion on a good way to do this and IDEALLY have an implementation of something like this in our MS-related training frameworks.

(I will progressively add people to the conversation but feel free to add anyone in the community whose input should be included here).

The text was updated successfully, but these errors were encountered:

jspaezp · 2023-06-08T20:32:20Z

@wfondrie any insights on this and opinion on adopting it for depth charge?

jspaezp · 2023-06-08T20:36:33Z

@RalfG what do you thinks ? (if we reach any consensus I would love to add a simple train/test split tutorial in proteomicsML)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pitch] Reproducible train/test splitting for peptides #23

[Pitch] Reproducible train/test splitting for peptides #23

jspaezp commented Jun 8, 2023

jspaezp commented Jun 8, 2023 •

edited

Loading

jspaezp commented Jun 8, 2023

[Pitch] Reproducible train/test splitting for peptides #23

[Pitch] Reproducible train/test splitting for peptides #23

Comments

jspaezp commented Jun 8, 2023

Context

Proposal

First implementation

jspaezp commented Jun 8, 2023 • edited Loading

jspaezp commented Jun 8, 2023

jspaezp commented Jun 8, 2023 •

edited

Loading