-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowed memory size exhausted on Font.php line 150 #735
Comments
Ultimately this is happening because the sample document has Font
... whereas PdfParser only extracts rows that match a regexp targeting the following format:
I think the solution is to treat rows that have square brackets as a direct string replacement rather than a numerical offset? But I am entirely unsure right now and will need to read up on the correct behaviour here. Edit: If I alter the regexp in Font.php on line 220: $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>?[ \r\n]+/is'; ... like so: $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>? *[\r\n]+/is'; ... this makes PdfParser ignore all |
@k00ni any idea how such a small thing like |
No. |
It's giving the wrong numerical start and end values to a It's not the |
Note that the "fix" I provided in my previous post does NOT completely solve this issue; it should not be used as a workaround as it will cause errors in other files in the test suite. I noticed, after my posts, that the current The proper fix is below, replacing this entire // Support for multiple bfrange sections
if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
foreach ($matches['sections'] as $section) {
// Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
// Some PDF file has 2-byte Unicode values on new lines > added \r\n
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *\[(?P<strings>[\r\n<>0-9A-F ]+)\][ \r\n]+/is';
preg_match_all($regexp, $section, $matches);
foreach ($matches['from'] as $key => $from) {
$char_from = hexdec($from);
$strings = [];
preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $matches['strings'][$key], $strings);
foreach ($strings['string'] as $position => $string) {
$parts = preg_split(
'/([0-9A-F]{4})/i',
$string,
0,
\PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
);
$text = '';
foreach ($parts as $part) {
$text .= self::uchr(hexdec($part));
}
$this->table[$char_from + $position] = $text;
}
// Remove these found matches from the bfrange section
// This prevents the regexp below from finding false matches
$section = str_replace($matches[0][$key], '', $section);
}
// Support for : <srcCode1> <srcCode2> <dstString>
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>[ \r\n]+/is';
preg_match_all($regexp, $section, $matches);
foreach ($matches['from'] as $key => $from) {
$char_from = hexdec($from);
$char_to = hexdec($matches['to'][$key]);
$offset = hexdec($matches['offset'][$key]);
for ($char = $char_from; $char <= $char_to; ++$char) {
$this->table[$char] = self::uchr($char - $char_from + $offset);
}
}
}
} The example PDF is a bit too large (466kB) to include in the test-suite I think. @UnnitMetaliya if you remember how you created this file (if you in fact created it) and are able to make a smaller version we can include in the test suite, would you be willing to do that for us? Otherwise, may we include the file you posted directly in the test suite? It's too complex of an issue to replicate with a mock document. Correctly declared font-tables are required. |
@GreyWyvern I did not create the test file but that pdf is publicly available. So, you can use it for sure. |
@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one? |
@GreyWyvern thank you for taking the time here. His changes look reasonable at first glance. I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can. |
I'm on vacation starting tomorrow and will be back on Monday next week. I can create the PR then. |
@GreyWyvern I can create now if you want? |
You can do that, sure. Thanks! But I won't be around to review it until next week. :) |
Are you working on a PR for this, @UnnitMetaliya? I don't want to step on any toes. |
@GreyWyvern yes. I should be able to put it out there soon. Just busy with work. |
Fix copied from smalot#735 This is temporary fork until the depenency is fixed
PHP 8.3.10 (cli)
smalot/pdfparser": "^2.11"
Description:
PDF input
unparseble.pdf
Expected output & actual output
Getting error:
PHP Fatal error: Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes) in ../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150
Symfony\Component\ErrorHandler\Error\FatalError
`Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)
at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150 146▕ 147▕ if (!isset(self::$uchrCache[$code])) { 148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities, 149▕ // therefore, we use mb_convert_encoding() instead ➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES'); 151▕ } 152▕ 153▕ return self::$uchrCache[$code]; 154▕ }
`Whoops\Exception\ErrorException
Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)
at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150
146▕
147▕ if (!isset(self::$uchrCache[$code])) {
148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
149▕ // therefore, we use mb_convert_encoding() instead
➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES');
151▕ }
152▕
153▕ return self::$uchrCache[$code];
154▕ }
2 [internal]:0
Whoops\Run::handleShutdown()
Code
$this->pdfParser = new Parser();
$pdf = $this->pdfParser->parseFile($file);
The text was updated successfully, but these errors were encountered: