-
Notifications
You must be signed in to change notification settings - Fork 499
Support for Office-Formats with Apache-Tika #600
base: master
Are you sure you want to change the base?
Conversation
06c73d4
to
8544d8f
Compare
8544d8f
to
5f677e1
Compare
Hey, I wonder if it would make sense to convert those files to PDF in the consumer. Having all documents stored as a PDF file has some advantages IMO:
What do you think? |
I see your point. Let me describe my use-case in more detail: I have an office document for let's say to cancel the insurance. I convert it to PDF, archive it in paperles and send the PDF via e-mail to the insurance. The next time I have a similar affair, I want to grab the original office document from paperless, change the details and send it as PDF to the next insurance. So the office documents function as templates somehow. At the moment, I store these templates on a separate NFS share. Probably the use-case is really specific to me and the NFS share solution might be sufficient. However, I don't like different ways of accessing my documents. Maybe it's not worse the effort. Should have asked beforehand.
Ah! That explains the problems with office documents and the font in the container. It was not necessary to provide them until now. |
Just commenting on "1. CORS_ORIGIN_WHITELIST issue" |
While I see your use-case, I don't see tika being a part of paperless. It's one thing to run a couple of binaries to parse a file, but it's imho another one to run a java based webserver next to a small app like paperless for this purpose. I'd rather see this implemented in another project, as a django app people can include, if they want to - maybe even under the hood of https://github.com/the-paperless-project. That being said, some of your already proposed changes - like supporting more document types in the model - seem to be neccessary. Others, like the tika dependency and the additional django module for That's my 2 cents. |
@Tooa Hi there. I've been working on a fork of paperless for a while now. If you're still around and want to make this happen, maybe we can work something out. Contact me if you're interested in contributing and we can discuss the details. |
Motivation and Description
This pull request adds support for office formats (such as
odt
,ods
,docx
, etc.) inpaperless
. In order to process these files and extract their content, Apache-Tika is applied:I started implementing this feature because I have office files stored on NFS not being part of my
paperless
instance. My goal is to have my personal documents in one place accessible via full-text search. This place should bepaperless
!This is a work in progress PR as I would like to receive initial feedback from the community before I start polishing it towards merging. Especially, I would like to know:
paperless
?This is a first working prototype - the feature is not complete and still under development (see open tasks).
In the short term, I would like to handle ms-office and open-office formats with this parser. In the long term, however, Apache-Tika might replace even the PDF/OCR parser since Tika also comes with support for
tesseract
. I haven't decided on the latter yet and the change will definitely not be part of this PR.Open Questions
I had to make two changes in order to get the original
master
working. I wonder, why nobody else has problems with the docker setup of the current master.1.
CORS_ORIGIN_WHITELIST
issueCORS_ORIGIN_WHITELIST
:Otherwise, I get the following error:
2. Missing Fonts Issue
msttcorefonts-installer
andfontconfig
to theDockerfile
. Otherwise, I get the following error:master
? If not, I'm fine to drop the change as it seems that it is a particular problem with my local machine setup.RUN
command.Open Tasks
TikaDocumentParser