Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

SRL: RuntimeError (index out of range) when predicting long sentence with uncommon tokens #3235

Closed
nsaef opened this issue Sep 10, 2019 · 2 comments

Comments

@nsaef
Copy link

nsaef commented Sep 10, 2019

Problem description and stack trace
I'm running Semantic Role Labelling, getting the model from this URL: "https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz".
When predicting a specific sentence, a RuntimeError is thrown: "index out of range at ..\aten\src\TH/generic/THTensorEvenMoreMath.cpp:193".

Here's the stack trace:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "\documents\allennlp\allennlp\predictors\semantic_role_labeler.py", line 180, in predict_batch_json
    outputs.extend(self._model.forward_on_instances(batch))
  File "\documents\allennlp\allennlp\models\model.py", line 153, in forward_on_instances
    outputs = self.decode(self(**model_input))
  File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "\documents\allennlp\allennlp\models\srl_bert.py", line 102, in forward
    output_all_encoded_layers=False)
  File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "\Anaconda\envs\st_win\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 730, in forward
    embedding_output = self.embeddings(input_ids, token_type_ids)
  File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "\Anaconda\envs\st_win\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 268, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\sparse.py", line 117, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\functional.py", line 1506, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at ..\aten\src\TH/generic/THTensorEvenMoreMath.cpp:193

The following sentence causes the error:

[{'sentence': 'Could you please isolate the samples of Methionylalanylthreonylserylarginylglycylalanylserylarginylcysteinylproly- \n larginylaspartylisoleucylalanylasparaginylvalylmethionylglutaminylarginyl- \n leucylglutaminylaspartylglutamylglutaminylglutamylisoleucylvalylglutaminy- \n llysylarginylthreonylphenylalanylthreonyllysyltryptophylisoleucylasparagi- \n nylserylhistidylleucylalanyllysylarginyllysylprolylprolylmethionylvalylva- \n lylaspartylaspartylleucylphenylalanylglutamylaspartylmethionyllysylaspart- \n ylglycylvalyllysylleucylleucylalanylleucylleucylglutamylvalylleucylserylg- \n lycylglutaminyllysylleucylprolylcysteinylglutamylglutaminylglycylarginyla- \n rginylmethionyllysylarginylisoleucylhistidylalanylvalylalanylasparaginyli- \n soleucylglycylthreonylalanylleucyllysylphenylalanylleucylglutamylglycylar- \n ginyllysylisoleucyllysylleucylvalylasparaginylisoleucylasparaginylserylth- \n reonylaspartylisoleucylalanylaspartylglycylarginylprolylserylisoleucylval- \n ylleucylglycylleucylmethionyltryptophylthreonylisoleucylisoleucylleucylty- \n rosylphenylalanylglutaminylisoleucylglutamylglutamylleucylthreonylserylas- \n paraginylleucylprolylglutaminylleucylglutaminylserylleucylserylserylseryl- \n alanylserylserylvalylaspartylserylisoleucylvalylserylserylglutamylthreony- \n lprolylserylprolylprolylseryllysylarginyllysylvalylthreonylthreonyllysyli- \n soleucylglutaminylglycylasparaginylalanyllysyllysylalanylleucylleucyllysy- \n ltryptophylvalylglutaminyltyrosylthreonylalanylglycyllysylglutaminylthreo- \n nylglycylisoleucylglutamylvalyllysylaspartylphenylalanylglycyllysylserylt- \n ryptophylarginylserylglycylvalylalanylphenylalanylhistidylserylvalylisole- \n ucylhistidylalanylisoleucylarginylprolylglutamylleucylvalylaspartylleucyl- \n glutamylthreonylvalyllysylglycylarginylserylasparaginylarginylglutamylasp- \n araginylleucylglutamylaspartylalanylphenylalanylthreonylisoleucylalanylgl- \n utamylthreonylglutamylleucylglycylisoleucylprolylarginylleucylleucylaspar- \n tylprolylglutamylaspartylvalylaspartylvalylaspartyllysylprolylaspartylglu- \n tamyllysylserylisoleucylmethionylthreonyltyrosylvalylalanylglutaminylphen- \n ylalanylleucyllysylhistidyltyrosylprolylaspartylisoleucylhistidylasparagi- \n nylalanylserylthreonylaspartylglycylglutaminylglutamylaspartylaspartylglu- \n tamylisoleucylleucylprolylglycylphenylalanylprolylserylphenylalanylalanyl- \n asparaginylserylvalylglutaminylasparaginylphenylalanyllysylarginylglutamy- \n laspartylarginylvalylisoleucylphenylalanyllysylglutamylmethionyllysylvaly- \n ltryptophylisoleucylglutamylglutaminylphenylalanylglutamylarginylaspartyl- \n leucylthreonylarginylalanylglutaminylmethionylvalylglutamylserylasparagin- \n ylleucylglutaminylaspartyllysyltyrosylglutaminylserylphenylalanyllysylhis- \n tidylphenylalanylarginylvalylglutaminyltyrosylglutamylmethionyllysylargin- \n yllysylglutaminylisoleucylglutamylhistidylleucylisoleucylglutaminylprolyl- \n leucylhistidylarginylaspartylglycyllysylleucylserylleucylaspartylglutamin- \n ylalanylleucylvalyllysylglutaminylseryltryptophylaspartylarginylvalylthre- \n onylserylarginylleucylphenylalanylaspartyltryptophylhistidylisoleucylglut- \n aminylleucylaspartyllysylserylleucylprolylalanylprolylleucylglycylthreony- \n lisoleucylglycylalanyltryptophylleucyltyrosylarginylalanylglutamylvalylal- \n anylleucylarginylglutamylglutamylisoleucylthreonylvalylglutaminylglutamin- \n ylvalylhistidylglutamylglutamylthreonylalanylasparaginylthreonylisoleucyl- \n glutaminylarginyllysylleucylglutamylglutaminylhistidyllysylaspartylleucyl- \n leucylglutaminylasparaginylthreonylaspartylalanylhistidyllysylarginylalan- \n ylphenylalanylhistidylglutamylisoleucyltyrosylarginylthreonylarginylseryl- \n valylasparaginylglycylisoleucylprolylvalylprolylprolylaspartylglutaminyll- \n eucylglutamylaspartylmethionylalanylglutamylarginylphenylalanylhistidylph-'}]

The sentence looks like this after tokenization by my spacy tokenizer:

[Could, you, please, isolate, the, samples, of, Methionylalanylthreonylserylarginylglycylalanylserylarginylcysteinylproly-, larginylaspartylisoleucylalanylasparaginylvalylmethionylglutaminylarginyl-, leucylglutaminylaspartylglutamylglutaminylglutamylisoleucylvalylglutaminy-, llysylarginylthreonylphenylalanylthreonyllysyltryptophylisoleucylasparagi-, nylserylhistidylleucylalanyllysylarginyllysylprolylprolylmethionylvalylva-, lylaspartylaspartylleucylphenylalanylglutamylaspartylmethionyllysylaspart-, ylglycylvalyllysylleucylleucylalanylleucylleucylglutamylvalylleucylserylg-, lycylglutaminyllysylleucylprolylcysteinylglutamylglutaminylglycylarginyla-, rginylmethionyllysylarginylisoleucylhistidylalanylvalylalanylasparaginyli-, soleucylglycylthreonylalanylleucyllysylphenylalanylleucylglutamylglycylar-, ginyllysylisoleucyllysylleucylvalylasparaginylisoleucylasparaginylserylth-, reonylaspartylisoleucylalanylaspartylglycylarginylprolylserylisoleucylval-, ylleucylglycylleucylmethionyltryptophylthreonylisoleucylisoleucylleucylty-, rosylphenylalanylglutaminylisoleucylglutamylglutamylleucylthreonylserylas-, paraginylleucylprolylglutaminylleucylglutaminylserylleucylserylserylseryl-, alanylserylserylvalylaspartylserylisoleucylvalylserylserylglutamylthreony-, lprolylserylprolylprolylseryllysylarginyllysylvalylthreonylthreonyllysyli-, soleucylglutaminylglycylasparaginylalanyllysyllysylalanylleucylleucyllysy-, ltryptophylvalylglutaminyltyrosylthreonylalanylglycyllysylglutaminylthreo-, nylglycylisoleucylglutamylvalyllysylaspartylphenylalanylglycyllysylserylt-, ryptophylarginylserylglycylvalylalanylphenylalanylhistidylserylvalylisole-, ucylhistidylalanylisoleucylarginylprolylglutamylleucylvalylaspartylleucyl-, glutamylthreonylvalyllysylglycylarginylserylasparaginylarginylglutamylasp-, araginylleucylglutamylaspartylalanylphenylalanylthreonylisoleucylalanylgl-, utamylthreonylglutamylleucylglycylisoleucylprolylarginylleucylleucylaspar-, tylprolylglutamylaspartylvalylaspartylvalylaspartyllysylprolylaspartylglu-, tamyllysylserylisoleucylmethionylthreonyltyrosylvalylalanylglutaminylphen-, ylalanylleucyllysylhistidyltyrosylprolylaspartylisoleucylhistidylasparagi-, nylalanylserylthreonylaspartylglycylglutaminylglutamylaspartylaspartylglu-, tamylisoleucylleucylprolylglycylphenylalanylprolylserylphenylalanylalanyl-, asparaginylserylvalylglutaminylasparaginylphenylalanyllysylarginylglutamy-, laspartylarginylvalylisoleucylphenylalanyllysylglutamylmethionyllysylvaly-, ltryptophylisoleucylglutamylglutaminylphenylalanylglutamylarginylaspartyl-, leucylthreonylarginylalanylglutaminylmethionylvalylglutamylserylasparagin-, ylleucylglutaminylaspartyllysyltyrosylglutaminylserylphenylalanyllysylhis-, tidylphenylalanylarginylvalylglutaminyltyrosylglutamylmethionyllysylargin-, yllysylglutaminylisoleucylglutamylhistidylleucylisoleucylglutaminylprolyl-, leucylhistidylarginylaspartylglycyllysylleucylserylleucylaspartylglutamin-, ylalanylleucylvalyllysylglutaminylseryltryptophylaspartylarginylvalylthre-, onylserylarginylleucylphenylalanylaspartyltryptophylhistidylisoleucylglut-, aminylleucylaspartyllysylserylleucylprolylalanylprolylleucylglycylthreony-, lisoleucylglycylalanyltryptophylleucyltyrosylarginylalanylglutamylvalylal-, anylleucylarginylglutamylglutamylisoleucylthreonylvalylglutaminylglutamin-, ylvalylhistidylglutamylglutamylthreonylalanylasparaginylthreonylisoleucyl-, glutaminylarginyllysylleucylglutamylglutaminylhistidyllysylaspartylleucyl-, leucylglutaminylasparaginylthreonylaspartylalanylhistidyllysylarginylalan-, ylphenylalanylhistidylglutamylisoleucyltyrosylarginylthreonylarginylseryl-, valylasparaginylglycylisoleucylprolylvalylprolylprolylaspartylglutaminyll-, eucylglutamylaspartylmethionylalanylglutamylarginylphenylalanylhistidylph-]

Tests and own research
When I cut the sentence of after 16 of the tokens containing (partial) chemical names, the predictor returns a result. It doesn't matter which 16 tokens are in the sentence, I can switch them around. When I don't cut the sentence off, the predictor throws the Exception cited above. I've read bug reports in other repos where this error was linked to out-of-vocabulary tokens, which could be the case here (f.i. chenxijun1029/DeepFM_with_PyTorch#1).

The sentence contains 56 tokens, which is way below the limit around 500 tokens I've encountered with other inputs. It's pretty long (3810 characters), but I've tested predicting other sentences with a similar length but made up of normal tokens, and that worked.

Question
I can modify the data after each crash like this and re-run the program, but I'd rather identify sentences likely to fail programmatically and alter them beforehand. Do you know what causes the problem and how it can be fixed/prevented?

System:

  • Ubuntu 18 LTS / Windows 10
  • Python 3.7
  • AllenNLP version: 0.8.5
  • PyTorch version: 1.1.0
@DeNeutoy
Copy link
Contributor

Hi! The SRL model has a limit of 512 wordpiece tokens, which means that typically you'll want your sentences to be shorter than 350 tokens. However, this could break down in your case as your tokens are disproportionally long.

This is the tokeniser that is used to go from tokens -> wordpieces:

https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertTokenizer

you could check the length beforehand for your input sentences and find a length that works for you?

@nsaef
Copy link
Author

nsaef commented Sep 11, 2019

Hi and thanks for the reply! I already cut off all sentences at 400 tokens, I may lower it to 350 as you suggested. Now that I know what to look for, I suppose I found the explanation in this stackoverflow post: https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble

It says about this out-of-vocabulary words:

Quite simply, OOV words are impossible if you use such a segmentation method. Any word which does not occur in the vocabulary will be broken down into subword units. Similarly, for rare words, given that the number of subword merges we used is limited, the word will not occur in the vocabulary, so it will be split into more frequent subwords.

As all of the 49 chemical name tokens will be out of vocabulary, the tokenizer likely splits them into lots of substrings of 2-5 characters that align with known words (such as "at", "is", "his", "alan", ...). When I split the text at each of the obvious words, I easily get more than 500 tokens.

I might implement a comparison with a dictionary to shorten long, unknown words, but in any case, this won't likely be a frequent problem. Thanks for pointing me in the right direction!

@nsaef nsaef closed this as completed Sep 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants