Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Support for Office-Formats with Apache-Tika #600

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Tooa
Copy link

@Tooa Tooa commented Jan 9, 2020

Motivation and Description

This pull request adds support for office formats (such as odt, ods, docx, etc.) in paperless. In order to process these files and extract their content, Apache-Tika is applied:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

I started implementing this feature because I have office files stored on NFS not being part of my paperless instance. My goal is to have my personal documents in one place accessible via full-text search. This place should be paperless!

This is a work in progress PR as I would like to receive initial feedback from the community before I start polishing it towards merging. Especially, I would like to know:

  • What do you think about the current way of including Apache-Tika into paperless?
  • Do you have any high-level remarks regarding this first working prototype?
  • Is this feature something people might find useful?
  • Is there a chance of getting this feature merged?
  • Any other comments?

This is a first working prototype - the feature is not complete and still under development (see open tasks).

In the short term, I would like to handle ms-office and open-office formats with this parser. In the long term, however, Apache-Tika might replace even the PDF/OCR parser since Tika also comes with support for tesseract. I haven't decided on the latter yet and the change will definitely not be part of this PR.

Open Questions

I had to make two changes in order to get the original master working. I wonder, why nobody else has problems with the docker setup of the current master.

1. CORS_ORIGIN_WHITELIST issue

  • I had to add the protocol to the value in CORS_ORIGIN_WHITELIST:
CORS_ORIGIN_WHITELIST = tuple(os.getenv("PAPERLESS_CORS_ALLOWED_HOSTS", "http://localhost:8080").split(","))

Otherwise, I get the following error:

SystemCheckError: System check identified some issues:
consumer_1     |
consumer_1     | ERRORS:
consumer_1     | ?: (corsheaders.E013) Origin 'localhost:8080' in CORS_ORIGIN_WHITELIST is missing  scheme or netloc
consumer_1     | 	HINT: Add a scheme (e.g. https://) or netloc (e.g. example.com).
  • Is this an issue for someone else too?

2. Missing Fonts Issue

  • For some reason, I had to add msttcorefonts-installer and fontconfig to the Dockerfile. Otherwise, I get the following error:
Parsers available: TikaDocumentParser
consumer_1     | Consuming /consume/test.odt
consumer_1     | convert: unable to read font `helvetica' @ error/annotate.c/RenderFreetype/1384.
consumer_1     | convert: no images defined `/tmp/paperless/paperless-5dfrf94n/tx.png' @ error/convert.c/ConvertImageCommand/3273.
consumer_1     | PARSE FAILURE for /consume/test.odt: Convert failed at ('convert', '-background none', '-fill', 'black', '-pointsize', '12', '-border 4 -bordercolor none', '-size ', '492x639', ' caption:"', '\n\n\n\n\n\n\n\nThis is a document\n', '" ', '/tmp/paperless/paperless-5dfrf94n/tx.png')
consumer_1     | Parsers available: TikaDocumentParser
consumer_1     | Consuming /consume/test.odt
consumer_1     | convert: unable to read font `helvetica' @ error/annotate.c/RenderFreetype/1384.
consumer_1     | convert: no images defined `/tmp/paperless/paperless-ys5cdlyn/tx.png' @ error/convert.c/ConvertImageCommand/3273.
consumer_1     | PARSE FAILURE for /consume/test.odt: Convert failed at ('convert', '-background none', '-fill', 'black', '-pointsize', '12', '-border 4 -bordercolor none', '-size ', '492x639', ' caption:"', '\n\n\n\n\n\n\n\nThis is a document\n', '" ', '/tmp/paperless/paperless-ys5cdlyn/tx.png')
  • Does anybody else have such an error on master? If not, I'm fine to drop the change as it seems that it is a particular problem with my local machine setup.
  • Otherwise, I can separate the change in its own commit and combine it with the other RUN command.

Open Tasks

  • Add ms-office file formats to TikaDocumentParser
  • Add tika to non docker installation (see requirements and setup)
    • I probably have to make the tika URL configurable
  • Add Unit-Tests
  • Add Exception Handling
  • Issue UI text field contains a lot of newlines after odt import
  • Add documentation
  • ...
  • Comply to Pep8 and additional style guides (See guidlines)
  • Rebase and Squash

@Tooa Tooa force-pushed the feat_add_tika_parser branch 3 times, most recently from 06c73d4 to 8544d8f Compare January 11, 2020 12:44
@bauerj
Copy link
Contributor

bauerj commented Jan 14, 2020

Hey,

I wonder if it would make sense to convert those files to PDF in the consumer.

Having all documents stored as a PDF file has some advantages IMO:

  • Documents are more portable, more systems can display a PDF document than a Word document
  • All fonts etc. are embedded, so your document looks the same even if you use a different system to view it
  • Makes it easier to switch to a different system in X years

What do you think?

@Tooa
Copy link
Author

Tooa commented Jan 15, 2020

I wonder if it would make sense to convert those files to PDF in the consumer.

I see your point. Let me describe my use-case in more detail:

I have an office document for let's say to cancel the insurance. I convert it to PDF, archive it in paperles and send the PDF via e-mail to the insurance. The next time I have a similar affair, I want to grab the original office document from paperless, change the details and send it as PDF to the next insurance. So the office documents function as templates somehow. At the moment, I store these templates on a separate NFS share.

Probably the use-case is really specific to me and the NFS share solution might be sufficient. However, I don't like different ways of accessing my documents. Maybe it's not worse the effort. Should have asked beforehand.

All fonts etc. are embedded, so your document looks the same even if you use a different system to view it

Ah! That explains the problems with office documents and the font in the container. It was not necessary to provide them until now.

@Whisprin
Copy link

Just commenting on "1. CORS_ORIGIN_WHITELIST issue"
I set PAPERLESS_CORS_ALLOWED_HOSTS="http://localhost:8000" directly in paperless.conf and it works.

@MasterofJOKers
Copy link
Contributor

While I see your use-case, I don't see tika being a part of paperless. It's one thing to run a couple of binaries to parse a file, but it's imho another one to run a java based webserver next to a small app like paperless for this purpose. I'd rather see this implemented in another project, as a django app people can include, if they want to - maybe even under the hood of https://github.com/the-paperless-project.

That being said, some of your already proposed changes - like supporting more document types in the model - seem to be neccessary. Others, like the tika dependency and the additional django module for paperless_tika, should imho be moved. I'd be happy to have that project referenced in the README and docs.

That's my 2 cents.

@jonaswinkler
Copy link
Contributor

jonaswinkler commented Dec 10, 2020

@Tooa Hi there. I've been working on a fork of paperless for a while now. If you're still around and want to make this happen, maybe we can work something out. Contact me if you're interested in contributing and we can discuss the details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants