We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couldn't open file
Describe the bug On Windows it doesn't open files with unicode in filenames.
To Reproduce
In Windows 10:
import textract textract.process(r"Making Democracy Count_ How Mathematics Improves Voting.pdf") textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")
The first filename opens fine, but the second fails because of the special character:
In [7]: textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf") --------------------------------------------------------------------------- ShellError Traceback (most recent call last) Cell In[7], line 1 ----> 1 textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf") File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\__init__.py:79, in process(filename, input_encoding, output_encoding, extension, **kwargs) 76 # do the extraction 78 parser = filetype_module.Parser() ---> 79 return parser.process(filename, input_encoding, output_encoding, **kwargs) File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:46, in BaseParser.process(self, filename, input_encoding, output_encoding, **kwargs) 36 """Process ``filename`` and encode byte-string with ``encoding``. This 37 method is called by :func:`textract.parsers.process` and wraps 38 the :meth:`.BaseParser.extract` method in `a delicious unicode 39 sandwich <http://nedbatchelder.com/text/unipain.html>`_. 40 41 """ 42 # make a "unicode sandwich" to handle dealing with unknown 43 # input byte strings and converting them to a predictable 44 # output encoding 45 # http://nedbatchelder.com/text/unipain/unipain.html#35 ---> 46 byte_string = self.extract(filename, **kwargs) 47 unicode_string = self.decode(byte_string, input_encoding) 48 return self.encode(unicode_string, output_encoding) File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:29, in Parser.extract(self, filename, method, **kwargs) 27 return self.extract_pdfminer(filename, **kwargs) 28 else: ---> 29 raise ex 31 elif method == 'pdfminer': 32 return self.extract_pdfminer(filename, **kwargs) File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:21, in Parser.extract(self, filename, method, **kwargs) 19 if method == '' or method == 'pdftotext': 20 try: ---> 21 return self.extract_pdftotext(filename, **kwargs) 22 except ShellError as ex: 23 # If pdftotext isn't installed and the pdftotext method 24 # wasn't specified, then gracefully fallback to using 25 # pdfminer instead. 26 if method == '' and ex.is_not_installed(): File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:44, in Parser.extract_pdftotext(self, filename, **kwargs) 42 else: 43 args = ['pdftotext', filename, '-'] ---> 44 stdout, _ = self.run(args) 45 return stdout File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:106, in ShellParser.run(self, args) 104 # if pipe is busted, raise an error (unlike Fabric) 105 if pipe.returncode != 0: --> 106 raise exceptions.ShellError( 107 ' '.join(args), pipe.returncode, stdout, stderr, 108 ) 110 return stdout, stderr ShellError: The command `pdftotext Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf -` failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- b"Error: Couldn't open file 'Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volic.pdf'\r\n"
Expected behavior It should extract the text from the files.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
On Windows it doesn't open files with unicode in filenames.
To Reproduce
In Windows 10:
The first filename opens fine, but the second fails because of the special character:
Expected behavior
It should extract the text from the files.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: