Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getDataTm() provides wrong coordinates for text blocks #733

Open
parpalak opened this issue Sep 2, 2024 · 1 comment
Open

getDataTm() provides wrong coordinates for text blocks #733

parpalak opened this issue Sep 2, 2024 · 1 comment
Labels

Comments

@parpalak
Copy link

parpalak commented Sep 2, 2024

I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects:

$text[] = $xobject->getText($page);

Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method:

$extractedTexts = $this->getTextArray();

$dataCommands = $this->getDataCommands();

However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.

Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out $text[] = $xobject->getText($page); helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.

@k00ni k00ni added the bug label Sep 3, 2024
@DominikDostal
Copy link
Contributor

I also had this problem, and made a workaround for myself in this if:

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {

I changed it from

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.
    $text[] = $xobject->getText($page);
}

to

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.

    //Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored
    $newText = $xobject->getText($page);
    if($newText === ' ') {
        break;
    }
    $text[] = $newText;
}

I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants