Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Padding text inputs to TextTransformation results in incorrect captions #588

Open
dsikka opened this issue Jul 29, 2023 · 4 comments
Open

Comments

@dsikka
Copy link

dsikka commented Jul 29, 2023

Hello,

I am trying to run the caption generation workflow and was wondering what I have to do if the inputs to the TextTransformer model are always padded to a fixed length? Padding the input with the pad_token_id results in nonsensical captions.

How should the attn mask be updated in both the TextTransformer and the MultiModalDecoder? Currently, the input to the TextTransformer increases as the caption is generated but I'd like to pad the input to a fixed length.

Thanks.

@lucidrains @gpucce @iejMac

@gpucce
Copy link
Contributor

gpucce commented Jul 31, 2023

hi @dsikka, can I ask you which generation type are you using? and how are you padding?

There should be fixed length argument that the generator can use, these are the arguments,

image,
text=None,
seq_len=30,
max_seq_len=77,
temperature=1.,
generation_type="beam_search",
top_p=0.1, # keep tokens in the 1 - top_p quantile
top_k=1, # keeps the top_k most probable tokens
pad_token_id=None,
eos_token_id=None,
sot_token_id=None,
num_beams=6,
num_beam_groups=3,
min_seq_len=5,
stopping_criteria=None,
repetition_penalty=1.0,
fixed_output_length=False # if True output.shape == (batch_size, seq_len)

using fix_output_length=True should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for

@dsikka
Copy link
Author

dsikka commented Jul 31, 2023

hi @dsikka, can I ask you which generation type are you using? and how are you padding?

There should be fixed length argument that the generator can use, these are the arguments,

image,
text=None,
seq_len=30,
max_seq_len=77,
temperature=1.,
generation_type="beam_search",
top_p=0.1, # keep tokens in the 1 - top_p quantile
top_k=1, # keeps the top_k most probable tokens
pad_token_id=None,
eos_token_id=None,
sot_token_id=None,
num_beams=6,
num_beam_groups=3,
min_seq_len=5,
stopping_criteria=None,
repetition_penalty=1.0,
fixed_output_length=False # if True output.shape == (batch_size, seq_len)

using fix_output_length=True should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for

Hi thanks for the quick reply @gpucce

I am currently using beam_search and was referring to the input to the text model in the forward pass:

text_latent, token_embs = self._encode_text(text, embed_cls=embed_cls)

The text input has variable length as the caption is generated. I wanted to pad this input such that all calls to self.text() have an input of the same size on this line:

text_latent, token_emb = self.text(text)

Possibly something like this, if we were to padd all inputs to length 15?

og_shape = text.shape[-1]
r = F.pad(text, (15 - (og_shape), 0))
text_latent, token_emb = self.text(r)

I was wondering how to do this correctly while also correctly updating the attn mask:

attn_mask = self.attn_mask

@dsikka
Copy link
Author

dsikka commented Aug 7, 2023

Hi, just wanted to follow-up on this?

@lucidrains @gpucce @iejMac

@dsikka
Copy link
Author

dsikka commented Aug 14, 2023

@lucidrains @gpucce @iejMac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants