Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't open file with special characters in filename #529

Open
endolith opened this issue Sep 7, 2024 · 0 comments
Open

Couldn't open file with special characters in filename #529

endolith opened this issue Sep 7, 2024 · 0 comments

Comments

@endolith
Copy link

endolith commented Sep 7, 2024

Describe the bug
On Windows it doesn't open files with unicode in filenames.

To Reproduce

In Windows 10:

import textract
textract.process(r"Making Democracy Count_ How Mathematics Improves Voting.pdf")
textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")

The first filename opens fine, but the second fails because of the special character:

In [7]: textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")
---------------------------------------------------------------------------
ShellError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\__init__.py:79, in process(filename, input_encoding, output_encoding, extension, **kwargs)
     76 # do the extraction
     78 parser = filetype_module.Parser()
---> 79 return parser.process(filename, input_encoding, output_encoding, **kwargs)

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:46, in BaseParser.process(self, filename, input_encoding, output_encoding, **kwargs)
     36 """Process ``filename`` and encode byte-string with ``encoding``. This
     37 method is called by :func:`textract.parsers.process` and wraps
     38 the :meth:`.BaseParser.extract` method in `a delicious unicode
     39 sandwich <http://nedbatchelder.com/text/unipain.html>`_.
     40
     41 """
     42 # make a "unicode sandwich" to handle dealing with unknown
     43 # input byte strings and converting them to a predictable
     44 # output encoding
     45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename, **kwargs)
     47 unicode_string = self.decode(byte_string, input_encoding)
     48 return self.encode(unicode_string, output_encoding)

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:29, in Parser.extract(self, filename, method, **kwargs)
     27             return self.extract_pdfminer(filename, **kwargs)
     28         else:
---> 29             raise ex
     31 elif method == 'pdfminer':
     32     return self.extract_pdfminer(filename, **kwargs)

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:21, in Parser.extract(self, filename, method, **kwargs)
     19 if method == '' or method == 'pdftotext':
     20     try:
---> 21         return self.extract_pdftotext(filename, **kwargs)
     22     except ShellError as ex:
     23         # If pdftotext isn't installed and the pdftotext method
     24         # wasn't specified, then gracefully fallback to using
     25         # pdfminer instead.
     26         if method == '' and ex.is_not_installed():

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:44, in Parser.extract_pdftotext(self, filename, **kwargs)
     42 else:
     43     args = ['pdftotext', filename, '-']
---> 44 stdout, _ = self.run(args)
     45 return stdout

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:106, in ShellParser.run(self, args)
    104 # if pipe is busted, raise an error (unlike Fabric)
    105 if pipe.returncode != 0:
--> 106     raise exceptions.ShellError(
    107         ' '.join(args), pipe.returncode, stdout, stderr,
    108     )
    110 return stdout, stderr

ShellError: The command `pdftotext Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf -` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"Error: Couldn't open file 'Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volic.pdf'\r\n"

Expected behavior
It should extract the text from the files.

Desktop (please complete the following information):

  • OS: Windows 10
  • Textract version: 1.6.5
  • Python version 3.12.4
  • Virtual environment: yes, in conda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant