Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the page number of each figure? #75

Open
LiyingCheng95 opened this issue Mar 20, 2024 · 6 comments
Open

How to get the page number of each figure? #75

LiyingCheng95 opened this issue Mar 20, 2024 · 6 comments

Comments

@LiyingCheng95
Copy link

I want to crop all the figures/images/tables in one pdf. Can get the page number of each figure in doc.figures[x]?

@kyleclo
Copy link
Collaborator

kyleclo commented Mar 20, 2024

hi @LiyingCheng95

please check out this example snippet in #63

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2020.acl-main.447.pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize figures on a page
page_id = 0
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
page_image._pilimage.crop(figure_box_xy)

@LiyingCheng95
Copy link
Author

Thanks for your prompt reply. However, it doesn't work for my case. For example, there is a figure on Page 8 in my pdf file. When I ran the code below, it can crop the figure for me. For this code, I have to indicate the page of each figure detected from the file.

recipe = CoreRecipe()
doc = recipe.run("path to my pdf")

# get the image of a page and its dimensions
page_image = doc.images[8]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = doc.figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
cropped_image = page_image._pilimage.crop(figure_box_xy)

cropped_image.save('cropped_image.jpg')

But when I ran this code below, it returned the error: "figure_box = figures[0].boxes[0] IndexError: list index out of range"

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent / "path to my pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize figures on a page
page_id = 8
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
cropped_image = page_image._pilimage.crop(figure_box_xy)
cropped_image.save('cropped_image.jpg')

Not sure what's wrong there?

@kyleclo
Copy link
Collaborator

kyleclo commented Mar 21, 2024

Do you mind emailing the PDF file?

@kyleclo
Copy link
Collaborator

kyleclo commented Mar 21, 2024

Thanks @LiyingCheng95 this is definitely a bug; I'm looking into patching it!

First, it seems like the figure is actually being detected correctly. For example:

recipe = CoreRecipe()
doc = recipe.from_pdf(pdf='your-file.pdf')

# asserts there are definitely figures on page 8
figures = [figure for figure in doc.figures if figure.boxes[0].page == 8]
assert len(figures) > 0
print(f"{figures[0].boxes}")

> [Box[0.12299907267594538, 0.05627375260667959, 0.731138803177521, 0.19940743706854958, 8]]

# i can visualize that figure on page 8
plot_entities_on_page(page_image=doc.images[8], entities=figures)

image

So I looked into where the bug is coming from. It seems like bug is coming from this cross-layer indexing operation is not finding a match:

figures[0].intersect_by_box("pages")
> []

doc.pages[8].intersect_by_box("figures")
> []

This is super weird because the boxes definitely overlap

doc.pages[0].boxes[0]
> Box[0.027564877832563207, 0.2701246785544094, 0.943916833476601, 0.523800017428919, 8]

figure.boxes[0]
> Box[0.12299907267594538, 0.05627375260667959, 0.731138803177521, 0.19940743706854958, 8]

So I checked and it looks like there's a bug in my Box.is_overlap logic:

figure.boxes[0].is_overlap(page.boxes[0])
> False

I'll work on fixing this.

In the meantime, you should be able to grab all the figures using doc.figures and if you want to check which page it's on, then it's for figure in doc.figures if figure.boxes[0].page == ??.

@xsank
Copy link

xsank commented Jul 30, 2024

You could use the layout parser directly to parse figures page by page.

@sssyaDavid
Copy link

is this problems be solved? I think I meet same problems here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants