-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2 - Support for JBIG2 (port of https://github.com/apache/pdfbox-jbig2) #631
base: master
Are you sure you want to change the base?
Conversation
…and replace System.Drawing.rectangle by Jbig2Rectangle and update NOTICE with JBig2
Thanks @BobLd, I'm thinking of stabilising the current master in order to release 0.1.8 before merging this. Once I've addressed a few remaining issues for the 0.1.8 version I think I'll merge this one directly since GitHub preserves @kasperdaff s authorship in the author field of the source commits? |
@EliotJones yes makes sense, sounds good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, as with PDF2.0 (historically), the deeply irritating ISO organization yet again paywalls the spec, this seems to be the closest to a freely available JBIG2 spec if you don't have $100s to give to those parasites: https://www.itu.int/rec/T-REC-T.88-201808-I/en
I have reviewed the code, a lot of it is a direct 1:1 port of https://github.com/apache/pdfbox-jbig2/blob/master/LICENSE.txt as mentioned in the original PR. Since I don't have book learning about copyright I'm not sure the position with respect to the NOTICES.txt vs copyright headers being in every file. Since the source repository uses the NOTICES.txt file in the same way we might be fine here, but I'm going to open a StackExchange question to confirm the right approach here (no doubt the question will be summarily closed).
If there was more work to create our implementation here I'd feel more comfortable but it is such a huge amount of code for such a rare/pointless image format (edit: actually since the common usage was document scans potentially not so pointless...) so I'm conflicted. It's also a huge code-debt to take on.
@BobLd I'd like to understand what is needed for continuing work on your image rendering branch. How much of this is blocking for further work on image rendering. I have some associated JPG related code here https://github.com/EliotJones/BigGustave/tree/add-jpg-support/src/BigGustave/Jpgs for things like Huffman trees, not sure if we could use that instead to unblock DCT decode? (Assuming we just skip JBIG2 images for now)
While I'm not opposed to getting this in I think I'd need to work on it in this branch to get it to a state I'm happier with from a code-ownership perspective. Especially since JBIG parsing comes with some (recent) vulnerabilities https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html
Let me know what you think
(please take my slightly annoyed tone as residual annoyance at ISO, they ruin my day every time, this work is great and thank you both for your efforts here)
thanks a lot @EliotJones for the response.
I've not reviewed the whole logic of the initial commit and just saw that it was closely following the java version so was confident it wored. Test also looked quite complete. Thanks for sending over the link, I'll have a look. I've only looking at how DCT decoding would work and saw Huffman tables involved, I thought it would be useful to not re-invent the weel
I agree and would be very curious about the StackExchange answer 😄
Makes a lot of sense too, I don't think there's a real rush to merge that. I think it's important to be confident with what's merged into the code
Image rendering is not a blocking point so far, I'll create an issue soon to explain how I view the next steps for image rendering and possible solutions. I will definitely check your repos as it will certainly be useful for DCT
As above, important to be confident with what's merged so no rush. And we can even find another way to process JBIG images down the line. I was very amazed to read these zero click vulnerabilities are linked to JBIG, fascinating read - thanks for that!
I'm as annoyed with these ISO paywal haha, no problem! Seems JPX2000 is also behind paywall by the way... |
@BobLd from this answer I think we're ok with the Notices attribution only and the license matching the source project. https://opensource.stackexchange.com/questions/14050/notices-txt-or-copyright-file-header-when-porting-code Interestingly another JBIG post made the rounds today https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning though less to do with JBIG being tricky and more the wrong usage of it. |
@EliotJones thanks for taking care of getting an answer for the NOTICE question, good to know it's straightforward. Regarding JBIG, I've read somewhere that one should never use it to compress images with text (even if it seems it was originally created for that) exactly for the issue discussed in the article... crazy |
Hello, I've been following the progress of the PR for integrating JBIG2 support into PdfPig, and I appreciate the effort to port the Java pdfbox-jbig2 project for this purpose. It's an exciting development for the .NET community working with PDFs and JBIG2. I'd like to suggest a slightly different approach that might bring additional benefits both to the PdfPig project and the wider .NET community. Given the specialized nature of JBIG2 support, it could be beneficial to develop this feature first as a standalone package or library, targeting .NET Standard/Core. Advantages of creating a separate Package/Library:
Creating a standalone library for JBIG2 aligns with the principles of open-source development and can stimulate more community involvement. It also ensures independent evolution of this feature, benefiting a broader audience within the .NET community while maintaining the focus and integrity of the PdfPig project. |
I ran into JBIG2 related issues today and ended up here. |
I'm not opposed to a separate library for this. Though I generally aim to keep PdfPig dependency free (outside of Microsoft/.NET dependencies) this is somewhere where code re-use makes sense and the cost of ownership for PdfPig is high for implementation. |
I kind of like the approach to use external existing libraries to handle missing filters and color space (ICC color) as these represent a non-negligible amount of work. Regarding using external libraries (with their own licenses), I think we can start with the following:
Now the question is how to actually make it possible to use these NuGet packages from the code, i.e. adding these filters/color space (coming from different NuGet packages) in external code. I'm not sure what is the best approach to take. @EliotJones let me know what you think |
What about just exposing opaque bytes for non-supported images, and allow the user to take a dependency on their own decoder of choice? |
@iamcarbon not sure what you mean here, I might be missing sthing though Pdfpig already exposes the bytes, even without decoding. Also keep in mind that once these bytes are decoded, pdfpig applies the color space. If you want to do a proof of concept of your idea, I'm happy to look into it |
By reading the code I just realized the solution is almost already in our hands... A third solution could be to implement a "plug-in architecture" within PdfPig. This would allow JBIG2 support to be added as an optional "plug-in" rather than a built-in feature. And this plug in architecture almost already exist thanks to existing interfaces. This approach combines the benefits of modularity and ease of integration. Users who need JBIG2 support can add the plug-in, while those who don't need it can keep their deployments lightweight. This also allows for independent updates / libraries and maintenance of the JBIG2 component (and possibly other filters). Explanation:
Why not just adding the possibility to add custom filters to With that, anyone could be able to add extended filter or their own version of an existing filter, including JBIG2. What do you think ? @BobLd @EliotJones |
@sbruyere makes a lot of sense to me, I just it would make more sense to do that at filter provider level, I.e. void AddFilterProvider(IFilterProvider filterProvider); |
You mean having more than one filter provider for processing or being able to replace / set the current filter provider ?
In my opinion, the simplest way is to have the ability to add custom filters to existing (and / or custom) filter providers. This way it will be easy to add or replace a filter to the default filter provider which already define all others default filters like |
@sbruyere I think I misunderstood your first comment. Happy for you to do a proof of concept PR (no need to implement the actual filters) to demonstrate your approach, we can then use that as a base for discussion |
It's on my plan indeed, I'll possibly submit a PR before end of month. Thanks for the feedback. |
@sbruyere do you still want to do a PoC PR? I might start working on that and ask for your opinion. Let me know |
@sbruyere I've updated PdfPig to allow for external filters. I've implemented the DCT filter in a different package https://github.com/BobLd/UglyToad.PdfPig.Filters.Dct.JpegLibrary - still early stage. @mvantzet I'm planning to release NuGet packages for JBIG2 and JPX shortly Any feedback welcome |
@EliotJones, I took @kasperdaff commit and added the changes from your comments in a 2nd commit.
I've also added the JBig2 notice inside the current NOTICE file.
As a side note, it seems some part of the code will be useful for JPX and DTC decode.
Also, to keep it clean, I'd would really prefer the original PR from @kasperdaff being accepted and merge instead of merging this PR (some conflicts will need to be resolved though). If the original PR is merge, we can just then cherry peek my changes and merge them.