Padding text inputs to `TextTransformation` results in incorrect captions #588

dsikka · 2023-07-29T20:09:14Z

Hello,

I am trying to run the caption generation workflow and was wondering what I have to do if the inputs to the TextTransformer model are always padded to a fixed length? Padding the input with the pad_token_id results in nonsensical captions.

How should the attn mask be updated in both the TextTransformer and the MultiModalDecoder? Currently, the input to the TextTransformer increases as the caption is generated but I'd like to pad the input to a fixed length.

Thanks.

@lucidrains @gpucce @iejMac

The text was updated successfully, but these errors were encountered:

gpucce · 2023-07-31T10:13:39Z

hi @dsikka, can I ask you which generation type are you using? and how are you padding?

There should be fixed length argument that the generator can use, these are the arguments,

open_clip/src/open_clip/coca_model.py

Lines 169 to 185 in 67e5e5e

    
           image, 
        
           text=None, 
        
           seq_len=30, 
        
           max_seq_len=77, 
        
           temperature=1., 
        
           generation_type="beam_search", 
        
           top_p=0.1,  # keep tokens in the 1 - top_p quantile 
        
           top_k=1,  # keeps the top_k most probable tokens 
        
           pad_token_id=None, 
        
           eos_token_id=None, 
        
           sot_token_id=None, 
        
           num_beams=6, 
        
           num_beam_groups=3, 
        
           min_seq_len=5, 
        
           stopping_criteria=None, 
        
           repetition_penalty=1.0, 
        
           fixed_output_length=False # if True output.shape == (batch_size, seq_len)

using fix_output_length=True should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for

dsikka · 2023-07-31T14:06:00Z

hi @dsikka, can I ask you which generation type are you using? and how are you padding?

There should be fixed length argument that the generator can use, these are the arguments,

open_clip/src/open_clip/coca_model.py

Lines 169 to 185 in 67e5e5e

image,

text=None,

seq_len=30,

max_seq_len=77,

temperature=1.,

generation_type="beam_search",

top_p=0.1, # keep tokens in the 1 - top_p quantile

top_k=1, # keeps the top_k most probable tokens

pad_token_id=None,

eos_token_id=None,

sot_token_id=None,

num_beams=6,

num_beam_groups=3,

min_seq_len=5,

stopping_criteria=None,

repetition_penalty=1.0,

fixed_output_length=False # if True output.shape == (batch_size, seq_len)

using fix_output_length=True should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for

Hi thanks for the quick reply @gpucce

I am currently using beam_search and was referring to the input to the text model in the forward pass:

open_clip/src/open_clip/coca_model.py

Line 151 in 67e5e5e

text_latent, token_embs = self._encode_text(text, embed_cls=embed_cls)

The text input has variable length as the caption is generated. I wanted to pad this input such that all calls to self.text() have an input of the same size on this line:

open_clip/src/open_clip/coca_model.py

Line 138 in 67e5e5e

text_latent, token_emb = self.text(text)

Possibly something like this, if we were to padd all inputs to length 15?

og_shape = text.shape[-1]
r = F.pad(text, (15 - (og_shape), 0))
text_latent, token_emb = self.text(r)

I was wondering how to do this correctly while also correctly updating the attn mask:

open_clip/src/open_clip/transformer.py

Line 604 in 67e5e5e

attn_mask = self.attn_mask

dsikka · 2023-08-07T15:15:37Z

Hi, just wanted to follow-up on this?

@lucidrains @gpucce @iejMac

dsikka · 2023-08-14T15:17:29Z

@lucidrains @gpucce @iejMac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding text inputs to `TextTransformation` results in incorrect captions #588

Padding text inputs to `TextTransformation` results in incorrect captions #588

dsikka commented Jul 29, 2023

gpucce commented Jul 31, 2023

dsikka commented Jul 31, 2023 •

edited

Loading

dsikka commented Aug 7, 2023

dsikka commented Aug 14, 2023

Padding text inputs to TextTransformation results in incorrect captions #588

Padding text inputs to TextTransformation results in incorrect captions #588

Comments

dsikka commented Jul 29, 2023

gpucce commented Jul 31, 2023

dsikka commented Jul 31, 2023 • edited Loading

dsikka commented Aug 7, 2023

dsikka commented Aug 14, 2023

Padding text inputs to `TextTransformation` results in incorrect captions #588

Padding text inputs to `TextTransformation` results in incorrect captions #588

dsikka commented Jul 31, 2023 •

edited

Loading