Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems when parsing older paper in PDF format #61

Open
SherryPan0 opened this issue Dec 15, 2023 · 2 comments
Open

problems when parsing older paper in PDF format #61

SherryPan0 opened this issue Dec 15, 2023 · 2 comments

Comments

@SherryPan0
Copy link

Hi, thanks for this great toolkit!
I tried the papermage with several PDF files. It works really well with recent papers but when I tried to parse some papers published in 1980 or 1989, papermage failed to parse the sentences.

doc = recipe.run("1980.pdf")
for sen in doc.sentences:
    print(sen.text)
'''
output:
Received
January
1978;
revised
October
1979;
accepted
December 1979
References
1.
Avery,
K.
R.
,
and
Avery,
C.
A.
Design
and
development
of an interactive
statistical
system
(SIPS).
Proc.
Comptr.
Sci.
and
Statistics: 8th
Ann.
Symp.
on
'''
@kyleclo
Copy link
Collaborator

kyleclo commented Dec 19, 2023

Interesting! could you send me the PDF so I can have a look at it? older PDFs not something we really investigated much

@SherryPan0
Copy link
Author

1980.pdf
1989.pdf
These are the two PDF files that I have tested. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants