You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
Problem description and stack trace
I'm running Semantic Role Labelling, getting the model from this URL: "https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz".
When predicting a specific sentence, a RuntimeError is thrown: "index out of range at ..\aten\src\TH/generic/THTensorEvenMoreMath.cpp:193".
Here's the stack trace:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "\documents\allennlp\allennlp\predictors\semantic_role_labeler.py", line 180, in predict_batch_json
outputs.extend(self._model.forward_on_instances(batch))
File "\documents\allennlp\allennlp\models\model.py", line 153, in forward_on_instances
outputs = self.decode(self(**model_input))
File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "\documents\allennlp\allennlp\models\srl_bert.py", line 102, in forward
output_all_encoded_layers=False)
File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "\Anaconda\envs\st_win\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 730, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "\Anaconda\envs\st_win\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 268, in forward
position_embeddings = self.position_embeddings(position_ids)
File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\modules\sparse.py", line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "\Anaconda\envs\st_win\lib\site-packages\torch\nn\functional.py", line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at ..\aten\src\TH/generic/THTensorEvenMoreMath.cpp:193
Tests and own research
When I cut the sentence of after 16 of the tokens containing (partial) chemical names, the predictor returns a result. It doesn't matter which 16 tokens are in the sentence, I can switch them around. When I don't cut the sentence off, the predictor throws the Exception cited above. I've read bug reports in other repos where this error was linked to out-of-vocabulary tokens, which could be the case here (f.i. chenxijun1029/DeepFM_with_PyTorch#1).
The sentence contains 56 tokens, which is way below the limit around 500 tokens I've encountered with other inputs. It's pretty long (3810 characters), but I've tested predicting other sentences with a similar length but made up of normal tokens, and that worked.
Question
I can modify the data after each crash like this and re-run the program, but I'd rather identify sentences likely to fail programmatically and alter them beforehand. Do you know what causes the problem and how it can be fixed/prevented?
System:
Ubuntu 18 LTS / Windows 10
Python 3.7
AllenNLP version: 0.8.5
PyTorch version: 1.1.0
The text was updated successfully, but these errors were encountered:
Hi! The SRL model has a limit of 512 wordpiece tokens, which means that typically you'll want your sentences to be shorter than 350 tokens. However, this could break down in your case as your tokens are disproportionally long.
This is the tokeniser that is used to go from tokens -> wordpieces:
Quite simply, OOV words are impossible if you use such a segmentation method. Any word which does not occur in the vocabulary will be broken down into subword units. Similarly, for rare words, given that the number of subword merges we used is limited, the word will not occur in the vocabulary, so it will be split into more frequent subwords.
As all of the 49 chemical name tokens will be out of vocabulary, the tokenizer likely splits them into lots of substrings of 2-5 characters that align with known words (such as "at", "is", "his", "alan", ...). When I split the text at each of the obvious words, I easily get more than 500 tokens.
I might implement a comparison with a dictionary to shorten long, unknown words, but in any case, this won't likely be a frequent problem. Thanks for pointing me in the right direction!
Problem description and stack trace
I'm running Semantic Role Labelling, getting the model from this URL: "https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz".
When predicting a specific sentence, a RuntimeError is thrown: "index out of range at ..\aten\src\TH/generic/THTensorEvenMoreMath.cpp:193".
Here's the stack trace:
The following sentence causes the error:
The sentence looks like this after tokenization by my spacy tokenizer:
Tests and own research
When I cut the sentence of after 16 of the tokens containing (partial) chemical names, the predictor returns a result. It doesn't matter which 16 tokens are in the sentence, I can switch them around. When I don't cut the sentence off, the predictor throws the Exception cited above. I've read bug reports in other repos where this error was linked to out-of-vocabulary tokens, which could be the case here (f.i. chenxijun1029/DeepFM_with_PyTorch#1).
The sentence contains 56 tokens, which is way below the limit around 500 tokens I've encountered with other inputs. It's pretty long (3810 characters), but I've tested predicting other sentences with a similar length but made up of normal tokens, and that worked.
Question
I can modify the data after each crash like this and re-run the program, but I'd rather identify sentences likely to fail programmatically and alter them beforehand. Do you know what causes the problem and how it can be fixed/prevented?
System:
The text was updated successfully, but these errors were encountered: