Allowed memory size exhausted on Font.php line 150 #735

UnnitMetaliya · 2024-09-16T17:23:26Z

PHP Version: PHP 8.3.10 (cli)
PDFParser Version: smalot/pdfparser": "^2.11"

Description:

PDF input

Expected output & actual output

Getting error:

PHP Fatal error: Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes) in ../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150

Symfony\Component\ErrorHandler\Error\FatalError

`Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)

at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150 146▕ 147▕ if (!isset(self::$uchrCache[$code])) { 148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities, 149▕ // therefore, we use mb_convert_encoding() instead ➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES'); 151▕ } 152▕ 153▕ return self::$uchrCache[$code]; 154▕ }

`Whoops\Exception\ErrorException

Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)

at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150
146▕
147▕ if (!isset(self::$uchrCache[$code])) {
148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
149▕ // therefore, we use mb_convert_encoding() instead
➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES');
151▕ }
152▕
153▕ return self::$uchrCache[$code];
154▕ }

  +1 vendor frames

2 [internal]:0
Whoops\Run::handleShutdown()

Code

$this->pdfParser = new Parser();
$pdf = $this->pdfParser->parseFile($file);

The text was updated successfully, but these errors were encountered:

GreyWyvern · 2024-09-16T18:03:13Z

Ultimately this is happening because the sample document has Font bfrange values such as:

<037F> <0389> [<FEDFFEE0FBAB> <FE80> <0622> <FE82> <0623> <FE84> <0624> <FE86> <0625> <FE88> <0626>]

... whereas PdfParser only extracts rows that match a regexp targeting the following format:

<037C> <037D> <FD3E>

I think the solution is to treat rows that have square brackets as a direct string replacement rather than a numerical offset? But I am entirely unsure right now and will need to read up on the correct behaviour here.

Edit: If I alter the regexp in Font.php on line 220:

$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>?[ \r\n]+/is';

... like so:

$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>? *[\r\n]+/is';

... this makes PdfParser ignore all bfrange lines that have square brackets and the document text will be output successfully. However, the content of the missing rows not being parsed probably means the document output text is missing some content.

UnnitMetaliya · 2024-09-17T05:26:24Z

@k00ni any idea how such a small thing like [ can cause memory leak issue?

k00ni · 2024-09-17T06:24:34Z

any idea how such a small thing like [ can cause memory leak issue?

No.

GreyWyvern · 2024-09-17T10:27:56Z

@k00ni any idea how such a small thing like [ can cause memory leak issue?

It's giving the wrong numerical start and end values to a for loop and it doesn't end. Simple as that.

It's not the [ specifically, it's that the regexp is matching values inside the square brackets and using character values incorrectly as offsets.

GreyWyvern · 2024-09-17T15:47:35Z

Note that the "fix" I provided in my previous post does NOT completely solve this issue; it should not be used as a workaround as it will cause errors in other files in the test suite.

I noticed, after my posts, that the current bfrange section in Font.php does account for the <0000> <0000> [<0000>] line format, but does so after it searches for regular <0000> <0000> <0000> style. The solution is to move the square-bracket format loop first, and then delete those lines from the section before allowing the regular-format loop to go at it. In this way, all items in the font table are properly filled and there is no infinite loop.

The proper fix is below, replacing this entire bfrange section:

// Support for multiple bfrange sections
if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
    foreach ($matches['sections'] as $section) {
        // Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
        // Some PDF file has 2-byte Unicode values on new lines > added \r\n
        $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *\[(?P<strings>[\r\n<>0-9A-F ]+)\][ \r\n]+/is';

        preg_match_all($regexp, $section, $matches);

        foreach ($matches['from'] as $key => $from) {
            $char_from = hexdec($from);
            $strings = [];

            preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $matches['strings'][$key], $strings);

            foreach ($strings['string'] as $position => $string) {
                $parts = preg_split(
                    '/([0-9A-F]{4})/i',
                    $string,
                    0,
                    \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
                );
                $text = '';
                foreach ($parts as $part) {
                    $text .= self::uchr(hexdec($part));
                }
                $this->table[$char_from + $position] = $text;
            }

            // Remove these found matches from the bfrange section
            // This prevents the regexp below from finding false matches
            $section = str_replace($matches[0][$key], '', $section);
        }

        // Support for : <srcCode1> <srcCode2> <dstString>
        $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>[ \r\n]+/is';

        preg_match_all($regexp, $section, $matches);

        foreach ($matches['from'] as $key => $from) {
            $char_from = hexdec($from);
            $char_to = hexdec($matches['to'][$key]);
            $offset = hexdec($matches['offset'][$key]);

            for ($char = $char_from; $char <= $char_to; ++$char) {
                $this->table[$char] = self::uchr($char - $char_from + $offset);
            }
        }
    }
}

The example PDF is a bit too large (466kB) to include in the test-suite I think. @UnnitMetaliya if you remember how you created this file (if you in fact created it) and are able to make a smaller version we can include in the test suite, would you be willing to do that for us?

Otherwise, may we include the file you posted directly in the test suite? It's too complex of an issue to replicate with a mock document. Correctly declared font-tables are required.

UnnitMetaliya · 2024-09-17T19:46:41Z

@GreyWyvern I did not create the test file but that pdf is publicly available. So, you can use it for sure.

UnnitMetaliya · 2024-09-17T19:47:19Z

@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one?

k00ni · 2024-09-18T07:43:05Z

@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one?

@GreyWyvern thank you for taking the time here.

His changes look reasonable at first glance. I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can.

GreyWyvern · 2024-09-18T13:33:33Z

I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can.

I'm on vacation starting tomorrow and will be back on Monday next week. I can create the PR then.

UnnitMetaliya · 2024-09-18T13:34:49Z

@GreyWyvern I can create now if you want?

GreyWyvern · 2024-09-18T13:38:22Z

@GreyWyvern I can create now if you want?

You can do that, sure. Thanks! But I won't be around to review it until next week. :)

GreyWyvern · 2024-09-26T18:41:57Z

Are you working on a PR for this, @UnnitMetaliya? I don't want to step on any toes.

UnnitMetaliya · 2024-09-26T19:06:51Z

@GreyWyvern yes. I should be able to put it out there soon. Just busy with work.

Fix copied from smalot#735 This is temporary fork until the depenency is fixed

UnnitMetaliya mentioned this issue Sep 16, 2024

Exception not handling on "Allowed memory size exhausted" #519

Open

satrun77 added a commit to silverstripeltd/pdfparser that referenced this issue Nov 12, 2024

Add memory timeout fix

11ad2cc

Fix copied from smalot#735 This is temporary fork until the depenency is fixed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowed memory size exhausted on Font.php line 150 #735

Allowed memory size exhausted on Font.php line 150 #735

UnnitMetaliya commented Sep 16, 2024

GreyWyvern commented Sep 16, 2024 •

edited

Loading

UnnitMetaliya commented Sep 17, 2024

k00ni commented Sep 17, 2024

GreyWyvern commented Sep 17, 2024

GreyWyvern commented Sep 17, 2024

UnnitMetaliya commented Sep 17, 2024

UnnitMetaliya commented Sep 17, 2024

k00ni commented Sep 18, 2024

GreyWyvern commented Sep 18, 2024

UnnitMetaliya commented Sep 18, 2024

GreyWyvern commented Sep 18, 2024

GreyWyvern commented Sep 26, 2024

UnnitMetaliya commented Sep 26, 2024

Allowed memory size exhausted on Font.php line 150 #735

Allowed memory size exhausted on Font.php line 150 #735

Comments

UnnitMetaliya commented Sep 16, 2024

Description:

PDF input

Expected output & actual output

Code

GreyWyvern commented Sep 16, 2024 • edited Loading

UnnitMetaliya commented Sep 17, 2024

k00ni commented Sep 17, 2024

GreyWyvern commented Sep 17, 2024

GreyWyvern commented Sep 17, 2024

UnnitMetaliya commented Sep 17, 2024

UnnitMetaliya commented Sep 17, 2024

k00ni commented Sep 18, 2024

GreyWyvern commented Sep 18, 2024

UnnitMetaliya commented Sep 18, 2024

GreyWyvern commented Sep 18, 2024

GreyWyvern commented Sep 26, 2024

UnnitMetaliya commented Sep 26, 2024

GreyWyvern commented Sep 16, 2024 •

edited

Loading