You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello
I want to use BERT pre-train model to get embedding and after that use embeddings with SVM to do binary classification.
How can I get embeddings? Is my code below true for getting embeddings? which one is embedding, sequence_output or pooled_output or embedding?
import torch
from tape import ProteinBertModel, TAPETokenizer
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac') # iupac is the vocab for TAPE models, use unirep for the UniRep model
# Pfam Family: Hexapep, Clan: CL0536
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
output = model(token_ids)
sequence_output = output[0]
pooled_output = output[1]
print(sequence_output.size())
print(pooled_output.size())
# NOTE: pooled_output is *not* trained for the transformer, do not use
# w/o fine-tuning. A better option for now is to simply take a mean of
# the sequence output
embedding= sum(sequence_output[0])/(len(sequence)+2)
print(sequence_output.size()) #Result of Run: torch.Size([1, 38, 768])
print(pooled_output.size()) #Result of Run: torch.Size([1, 768])
print(embedding.size()) #Result of Run: torch.Size([768])
How can I use this code for 100 protein sequences? Should I use For Loop?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
Hello
I want to use BERT pre-train model to get embedding and after that use embeddings with SVM to do binary classification.
Thank you in advance!
The text was updated successfully, but these errors were encountered: