Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.to_image() PDFium: Data format erro #1179

Open
dalinautoagents opened this issue Jul 30, 2024 · 4 comments
Open

page.to_image() PDFium: Data format erro #1179

dalinautoagents opened this issue Jul 30, 2024 · 4 comments
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author bug

Comments

@dalinautoagents
Copy link

dalinautoagents commented Jul 30, 2024

Describe the bug

A clear and concise description of what the bug is.

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • pdfplumber==0.10.4
  • Python version: [e.g., 3.11.0]
  • OS: docker

Additional context

Add any other context/notes about the problem here.

it's easy to reproduce, two big pdf,and run code:

self.pdf = pdfplumber.open(fnm) if isinstance(
fnm, str) else pdfplumber.open(BytesIO(fnm))

self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
enumerate(self.pdf.pages[page_from:page_to])]

I think there's a concurrency issue with 'to_image'

update-----
when i try to add lock ,and it works ok

@jsvine
Copy link
Owner

jsvine commented Jul 31, 2024

Thank you for raising this issue. Please try updating to the latest version of pdfplumber. Do you still encounter the problem? If so, can you share a fully-reproducible script?

@dalinautoagents
Copy link
Author

dalinautoagents commented Jul 31, 2024

@jwilk
pdfplumber==0.11.1

without lock, when i run it at the same time with two big file, i will get these error

 try:
            self.pdf = pdfplumber.open(fnm) if isinstance(
                fnm, str) else pdfplumber.open(BytesIO(fnm))
            self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
                                enumerate(self.pdf.pages[page_from:page_to])]
            self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in
                               self.pdf.pages[page_from:page_to]]
            self.total_page = len(self.pdf.pages)
        except Exception as e:
            traceback.print_exc()
            logging.error(str(e))

and i add lock ,it work ok

 try:
            lock.acquire()
            self.pdf = pdfplumber.open(fnm) if isinstance(
                fnm, str) else pdfplumber.open(BytesIO(fnm))
            self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
                                enumerate(self.pdf.pages[page_from:page_to])]
            self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in
                               self.pdf.pages[page_from:page_to]]
            self.total_page = len(self.pdf.pages)
        except Exception as e:
            traceback.print_exc()
            logging.error(str(e))
        finally:
            lock.release()

but with lock ,Efficiency is too low

@dalinautoagents
Copy link
Author

@jsvine could you give me some idea to fix it, i don't know what can i do to improve efficiency

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2024

Hi @dalinautoagents, those code snippets reference external unstated variables and also combine image-related processing with other logic, creating an obstacle to reproduction. Could you create a simplified Python script that can be run directly and reproduces the error you're seeing?

@jsvine jsvine added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author bug
Projects
None yet
Development

No branches or pull requests

2 participants