Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 - Support for JBIG2 (port of https://github.com/apache/pdfbox-jbig2) #631

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

BobLd
Copy link
Collaborator

@BobLd BobLd commented May 26, 2023

@EliotJones, I took @kasperdaff commit and added the changes from your comments in a 2nd commit.

I've also added the JBig2 notice inside the current NOTICE file.

As a side note, it seems some part of the code will be useful for JPX and DTC decode.

Also, to keep it clean, I'd would really prefer the original PR from @kasperdaff being accepted and merge instead of merging this PR (some conflicts will need to be resolved though). If the original PR is merge, we can just then cherry peek my changes and merge them.

kasperdaff and others added 2 commits May 26, 2023 14:02
…and replace System.Drawing.rectangle by Jbig2Rectangle and update NOTICE with JBig2
@BobLd BobLd requested a review from EliotJones May 26, 2023 13:37
@EliotJones
Copy link
Member

Thanks @BobLd, I'm thinking of stabilising the current master in order to release 0.1.8 before merging this. Once I've addressed a few remaining issues for the 0.1.8 version I think I'll merge this one directly since GitHub preserves @kasperdaff s authorship in the author field of the source commits?

@BobLd
Copy link
Collaborator Author

BobLd commented Jun 4, 2023

@EliotJones yes makes sense, sounds good to me

Copy link
Member

@EliotJones EliotJones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, as with PDF2.0 (historically), the deeply irritating ISO organization yet again paywalls the spec, this seems to be the closest to a freely available JBIG2 spec if you don't have $100s to give to those parasites: https://www.itu.int/rec/T-REC-T.88-201808-I/en

I have reviewed the code, a lot of it is a direct 1:1 port of https://github.com/apache/pdfbox-jbig2/blob/master/LICENSE.txt as mentioned in the original PR. Since I don't have book learning about copyright I'm not sure the position with respect to the NOTICES.txt vs copyright headers being in every file. Since the source repository uses the NOTICES.txt file in the same way we might be fine here, but I'm going to open a StackExchange question to confirm the right approach here (no doubt the question will be summarily closed).

If there was more work to create our implementation here I'd feel more comfortable but it is such a huge amount of code for such a rare/pointless image format (edit: actually since the common usage was document scans potentially not so pointless...) so I'm conflicted. It's also a huge code-debt to take on.

@BobLd I'd like to understand what is needed for continuing work on your image rendering branch. How much of this is blocking for further work on image rendering. I have some associated JPG related code here https://github.com/EliotJones/BigGustave/tree/add-jpg-support/src/BigGustave/Jpgs for things like Huffman trees, not sure if we could use that instead to unblock DCT decode? (Assuming we just skip JBIG2 images for now)

While I'm not opposed to getting this in I think I'd need to work on it in this branch to get it to a state I'm happier with from a code-ownership perspective. Especially since JBIG parsing comes with some (recent) vulnerabilities https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html

Let me know what you think

(please take my slightly annoyed tone as residual annoyance at ISO, they ruin my day every time, this work is great and thank you both for your efforts here)

@BobLd
Copy link
Collaborator Author

BobLd commented Jun 7, 2023

thanks a lot @EliotJones for the response.

Unfortunately, as with PDF2.0 (historically), the deeply irritating ISO organization yet again paywalls the spec, this seems to be the closest to a freely available JBIG2 spec if you don't have $100s to give to those parasites: https://www.itu.int/rec/T-REC-T.88-201808-I/en

I've not reviewed the whole logic of the initial commit and just saw that it was closely following the java version so was confident it wored. Test also looked quite complete. Thanks for sending over the link, I'll have a look. I've only looking at how DCT decoding would work and saw Huffman tables involved, I thought it would be useful to not re-invent the weel

I have reviewed the code, a lot of it is a direct 1:1 port of https://github.com/apache/pdfbox-jbig2/blob/master/LICENSE.txt as mentioned in the original PR. Since I don't have book learning about copyright I'm not sure the position with respect to the NOTICES.txt vs copyright headers being in every file. Since the source repository uses the NOTICES.txt file in the same way we might be fine here, but I'm going to open a StackExchange question to confirm the right approach here (no doubt the question will be summarily closed).

I agree and would be very curious about the StackExchange answer 😄

If there was more work to create our implementation here I'd feel more comfortable but it is such a huge amount of code for such a rare/pointless image format (edit: actually since the common usage was document scans potentially not so pointless...) so I'm conflicted. It's also a huge code-debt to take on.

Makes a lot of sense too, I don't think there's a real rush to merge that. I think it's important to be confident with what's merged into the code

@BobLd I'd like to understand what is needed for continuing work on your image rendering branch. How much of this is blocking for further work on image rendering. I have some associated JPG related code here https://github.com/EliotJones/BigGustave/tree/add-jpg-support/src/BigGustave/Jpgs for things like Huffman trees, not sure if we could use that instead to unblock DCT decode? (Assuming we just skip JBIG2 images for now)

Image rendering is not a blocking point so far, I'll create an issue soon to explain how I view the next steps for image rendering and possible solutions. I will definitely check your repos as it will certainly be useful for DCT

While I'm not opposed to getting this in I think I'd need to work on it in this branch to get it to a state I'm happier with from a code-ownership perspective. Especially since JBIG parsing comes with some (recent) vulnerabilities https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html

As above, important to be confident with what's merged so no rush. And we can even find another way to process JBIG images down the line. I was very amazed to read these zero click vulnerabilities are linked to JBIG, fascinating read - thanks for that!

(please take my slightly annoyed tone as residual annoyance at ISO, they ruin my day every time, this work is great and thank you both for your efforts here)

I'm as annoyed with these ISO paywal haha, no problem! Seems JPX2000 is also behind paywall by the way...

@EliotJones
Copy link
Member

@BobLd from this answer I think we're ok with the Notices attribution only and the license matching the source project. https://opensource.stackexchange.com/questions/14050/notices-txt-or-copyright-file-header-when-porting-code

Interestingly another JBIG post made the rounds today https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning though less to do with JBIG being tricky and more the wrong usage of it.

@BobLd
Copy link
Collaborator Author

BobLd commented Jun 17, 2023

@EliotJones thanks for taking care of getting an answer for the NOTICE question, good to know it's straightforward.

Regarding JBIG, I've read somewhere that one should never use it to compress images with text (even if it seems it was originally created for that) exactly for the issue discussed in the article... crazy

@sbruyere
Copy link
Contributor

sbruyere commented Nov 13, 2023

Hello,

I've been following the progress of the PR for integrating JBIG2 support into PdfPig, and I appreciate the effort to port the Java pdfbox-jbig2 project for this purpose. It's an exciting development for the .NET community working with PDFs and JBIG2.

I'd like to suggest a slightly different approach that might bring additional benefits both to the PdfPig project and the wider .NET community. Given the specialized nature of JBIG2 support, it could be beneficial to develop this feature first as a standalone package or library, targeting .NET Standard/Core.

Advantages of creating a separate Package/Library:

  • Wider Applicability: As a separate library, JBIG2 support could be used by other projects beyond PdfPig, fostering broader adoption and contribution across the .NET community.
  • Focused Development and Testing: An independent library allows for dedicated development and testing of JBIG2 features, improving the overall quality and performance.
  • Ease of Distribution via NuGet: Distributing this as a standalone package on NuGet would make it more accessible and encourage its use within the .NET ecosystem.
  • License Compliance Checks: Developing this feature separately initially allows for a thorough review of licensing, ensuring compliance and mitigating potential legal issues for PdfPig.
  • Future Integration Possibilities: Once the library is stable and its licensing is clear, it could be seamlessly integrated into PdfPig as a NuGet dependency. This ensures the addition of a well-supported, community-approved feature to PdfPig.

Creating a standalone library for JBIG2 aligns with the principles of open-source development and can stimulate more community involvement. It also ensures independent evolution of this feature, benefiting a broader audience within the .NET community while maintaining the focus and integrity of the PdfPig project.

@mvantzet
Copy link
Contributor

I ran into JBIG2 related issues today and ended up here.
@EliotJones / @BobLd: Any thoughts about the suggestion from @sbruyere?

@EliotJones
Copy link
Member

I'm not opposed to a separate library for this. Though I generally aim to keep PdfPig dependency free (outside of Microsoft/.NET dependencies) this is somewhere where code re-use makes sense and the cost of ownership for PdfPig is high for implementation.

@BobLd
Copy link
Collaborator Author

BobLd commented Jan 11, 2024

I kind of like the approach to use external existing libraries to handle missing filters and color space (ICC color) as these represent a non-negligible amount of work.

Regarding using external libraries (with their own licenses), I think we can start with the following:

Now the question is how to actually make it possible to use these NuGet packages from the code, i.e. adding these filters/color space (coming from different NuGet packages) in external code. I'm not sure what is the best approach to take.

@EliotJones let me know what you think

@iamcarbon
Copy link
Collaborator

What about just exposing opaque bytes for non-supported images, and allow the user to take a dependency on their own decoder of choice?

@BobLd
Copy link
Collaborator Author

BobLd commented Mar 14, 2024

@iamcarbon not sure what you mean here, I might be missing sthing though

Pdfpig already exposes the bytes, even without decoding. Also keep in mind that once these bytes are decoded, pdfpig applies the color space.

If you want to do a proof of concept of your idea, I'm happy to look into it

@sbruyere
Copy link
Contributor

sbruyere commented May 27, 2024

By reading the code I just realized the solution is almost already in our hands...

A third solution could be to implement a "plug-in architecture" within PdfPig. This would allow JBIG2 support to be added as an optional "plug-in" rather than a built-in feature. And this plug in architecture almost already exist thanks to existing interfaces.

This approach combines the benefits of modularity and ease of integration. Users who need JBIG2 support can add the plug-in, while those who don't need it can keep their deployments lightweight. This also allows for independent updates / libraries and maintenance of the JBIG2 component (and possibly other filters).

Explanation:

  • Filter implements the public interface IFilter ;
  • Filter are managed by a filter provider (a class that implements the public interface IFilterProvider), the default one is DefaultFilterProvider.

Why not just adding the possibility to add custom filters to IFilterProvider with a new void AddFilter(IFilter filter); method ?

With that, anyone could be able to add extended filter or their own version of an existing filter, including JBIG2.

What do you think ? @BobLd @EliotJones

@BobLd
Copy link
Collaborator Author

BobLd commented May 28, 2024

@sbruyere makes a lot of sense to me, I just it would make more sense to do that at filter provider level, I.e.

void AddFilterProvider(IFilterProvider filterProvider);

@sbruyere
Copy link
Contributor

@sbruyere makes a lot of sense to me, I just it would make more sense to do that at filter provider level, I.e.

void AddFilterProvider(IFilterProvider filterProvider);

You mean having more than one filter provider for processing or being able to replace / set the current filter provider ?

  • If you have multiple filter provider, this could lead to having more than one filter for the same kind of data, not sure it's the expected behavior, and it will make far more complexe from which filter provider you have to get the output data.
  • If you means to set / replace the current filter provider, there is another issue: default common filters that are defined in DefaultFilterProvider are set "internals", which mean we canno't use them in a custom defined Filter Provider:
    internal class Ascii85Filter : IFilter; however it could be interesting to change their visibilities to public for some reasons.

In my opinion, the simplest way is to have the ability to add custom filters to existing (and / or custom) filter providers. This way it will be easy to add or replace a filter to the default filter provider which already define all others default filters like Ascii85Filter.

@BobLd
Copy link
Collaborator Author

BobLd commented Jun 3, 2024

@sbruyere I think I misunderstood your first comment. Happy for you to do a proof of concept PR (no need to implement the actual filters) to demonstrate your approach, we can then use that as a base for discussion

@sbruyere
Copy link
Contributor

sbruyere commented Jun 7, 2024

@sbruyere I think I misunderstood your first comment. Happy for you to do a proof of concept PR (no need to implement the actual filters) to demonstrate your approach, we can then use that as a base for discussion

It's on my plan indeed, I'll possibly submit a PR before end of month. Thanks for the feedback.

@BobLd
Copy link
Collaborator Author

BobLd commented Jul 21, 2024

@sbruyere do you still want to do a PoC PR? I might start working on that and ask for your opinion. Let me know

@BobLd
Copy link
Collaborator Author

BobLd commented Oct 18, 2024

@sbruyere I've updated PdfPig to allow for external filters. I've implemented the DCT filter in a different package https://github.com/BobLd/UglyToad.PdfPig.Filters.Dct.JpegLibrary - still early stage.

@mvantzet I'm planning to release NuGet packages for JBIG2 and JPX shortly

Any feedback welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants