Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Enhanced purls for generic artifacts #692

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

brunoapimentel
Copy link
Contributor

@brunoapimentel brunoapimentel commented Oct 21, 2024

This draft introduces a more generic alternative to solve the need for specific Maven purls covered in #663.

In summary, it is a mechanism that would allow us to support a subset of purl types that do not fit into existing package manager ecosystems. Cachi2 would receive purls as input, resolve them into a download URL, and report them in the SBOM.

Maintainers will complete the following section

  • Commit messages are descriptive enough
  • Code coverage from testing does not decrease and new code is covered
  • Docs updated (if applicable)
  • Docs links in the code are still valid (if docs were updated)

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
### Cachi2 CLI usage

```
cachi2 fetch-deps --source /path/to/repo generic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to have an ability to add arbitrary configs to the mix?

cachi2 fetch-deps --source /path/to/repo generic --add-to-the-mix ../generic-artifacts.yaml --detect-components-with ../cachi2-config.yaml

I still cannot wrap my head around cachi2-specific configs in (potentially) unrelated repos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll just use/enhance the existing config-file functionality, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Alexey is refering to the lockfile itself. I've yet to see users complaining about adding the lockfiles, but --add-to-the-mix is definitely a possibility.

The main concern I have against it is that, by having everything committed to the repo, we can improve the reproducibility of the requests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can default to a location within a repo and use defaut configuration mechanism, but would need a way to point to some external config which will be version-controlled elsewhere. Suppose someone needs to build ScaPy (or some other kilostarred package) and suppose they also need some prebuilt generic artifact. I don't think it is safe to assume that upstream ScaPy would just accept any cachi2 yaml. If it is a local clone the someone would still need to add the config and maintain it. Likely not a blocker right now, but I feel like we'll need to deal with this at some point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with some of the @a-ovchinnikov's reasoning when it comes to checking-in some random file, but then again, imagine a blackbox pipeline, where would such a file come from? It might easily become a devops engineer's nightmare. I also agree with @brunoapimentel that from reproducibility POV and ease of debugging you want users to commit the file to their repos. Now, why would any random big upstream care about cachi2? Exactly, they won't, that's why downstream exists and then such a file is perfectly fine to be checked in to a repo. However, downstream isn't very popular in K8S world and so users tend to solve it with git submodules instead to compensate for the need of downstream control and here again, checking in the cachi2 lockfile is perfectly fine. I don't even think we need a different way of providing it ATM. If the need arises, we can discuss again.

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved

## Enhanced purls overview

- Implement an `enhancer` for every supported purl type, which is essentially a set of rules that will be applied to a generic artifact and, in case those rules can be matched, it will replace the generic purl for a more specific type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in that case we have to be prepared for the burden of pretty much of being in sync with most of the PURL spec and keeping up with different pure data PURL types. I'm not a fan per se because we'll essentially help co-maintain the PURL-spec from the shadows (or at least be one of the loudest complaining stakeholders) - IOW the PURL spec currently doesn't keep up with itself and many of their current user RFEs on spec enhancement stem from the fact that many PURL types are defined very vaguely, giving too much room for interpretation to the implementations. The biggest problem with that is that there ARE NO releases and so if the spec needs to fix something, it'll break our implementation. Additionally, there currently isn't ONE true implementation of the spec which complicates lives of the few spec readers out there and now we'll become one of them.
Note it's not that much about the resulting PURLs themselves, but the qualifiers we'll need to keep up with.

You can also count on requests asking for adding support for their own super-duper qualifier (yep, syft) which doesn't exist in the spec, we'll have to be very careful about that and not accept anything that isn't in the spec! (then again, RPM PURLs suck so much that we've already done that, so there goes...)

Since the main motivation here is Maven the other alternative would truly be a different lockfile, but the same backend code for the fetcher. While that may have started to make sense to me in retrospect, I don't think we'd end there and I can already see requests for those various data-typed PURLs describing the artifacts which means we'd have to do what you propose at some point anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a more complicated approach to me than what is proposed in #663

Besides dealing with complexity of recognizing component types and creating proper PURLs for them, this proposal is based on the assumption that users will actually want Cachi2 to do that.
If a user provides a PURL to fetch, how do you know that the user will prefer Cachi2 to change it to something else?
How can a user be in control of the recorded PURLs?
What if Cachi2 can't get it right (for some reason) but the user knows what it should be, how could a user provide the desired target PURL?

IMO, whatever a user provides should not be altered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the core of the discussion that led us to the change. We, as Cachi2 maintainers, are not confident we should allow users to specify how the reported purls should be. And to clarify, this design proposes that only resolved download_urls are used as input, which means we wouldn't accept purls as input.

If we consider Cachi2 as a standalone project, its goal is to prefetch the dependencies necessary for a hermetic build, and report them the most accurate way possible, but only to the extent of the info it has at hand during the prefetch. Sometimes, the user will have a better purl than what Cachi2 can provide. That is not true in all cases, though. Allowing purls to be passed through creates a precedent that goes against our ultimate goal of accurate SBOMs.

Now, if we need better purls for our specific internal case, maybe they can be enhanced in other points of the pipeline? But we can lead this discussion somewhere else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design proposes that only resolved download_urls are used as input

Thanks, I overlooked that.

If we consider Cachi2 as a standalone project, its goal is to prefetch the dependencies necessary for a hermetic build, and report them the most accurate way possible, but only to the extent of the info it has at hand during the prefetch.

Agreed.

Allowing purls to be passed through creates a precedent that goes against our ultimate goal of accurate SBOMs.

Could you please elaborate on that?

Whatever Cachi2 produces will be produced (ultimately) for the user who provides the input. The user will be responsible for the quality of the input and, as a consequence, the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever Cachi2 produces will be produced (ultimately) for the user who provides the input. The user will be responsible for the quality of the input and, as a consequence, the output.

I agree the user has substantial responsibility towards the quality of the input/output, but Cachi2 has traditionally been on the restrictive side of things. For instance, for pip, Cachi2 simply won't process repos that don't have a fully resolved requirements file (with pinned versions and hashes). Cachi2 will never execute a setup.py file because this allows the execution of arbitrary code, which undermines our certainty that the only downloaded files were exactly the ones that are described in a certain ref of a repo.

Having Cachi2 receive a purl as input, resolve it to download an artifact, and simply copy paste the purl to the output SBOM seems to be opening a wide scope of things to happen. From the accuracy perspective, it could allow from simple "typos" to non-compliant purls to be passed through. From a practical standpoint, we would still need to validate the purls to an extent, and to keep up with the purl-spec.

Looking from another perspective, what we're proposing here is already allowing the user to specify the purl (with a limitation of the subset of purl types and details about the ouput). The decision here is mostly to not have purls as input, and to limit what can be done as output. So my question is, would the output purl for maven proposed be outside of what you expect? Are there any corner cases of concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes, the user will have a better purl than what Cachi2 can provide

That's the sad reality of the PURL spec. In an ideal a normal world the spec (like any other spec) itself would define what a perfect PURL looks like for each artifact type and with cachi2 strictly following the spec this statement would hold no ground; PURL spec doesn't live in that world :( .

Whatever Cachi2 produces will be produced (ultimately) for the user who provides the input. The user will be responsible for the quality of the input and, as a consequence, the output.

@aloubyansky this is AFAIK called tainting in our field, i.e. consuming user input with little to no validation and use it for output generation (in this case taking the PURL verbatim) and as such tainting is undesirable in general and so we'd like to avoid it by forcing users to provide the information in form of attributes leading to the desired PURL on the output. This is especially problematic with the PURL spec, which, like I mentioned earlier somewhere the spec is quite vague for some artifact types and we'd have to trust user's judgement on the quality of an input PURL.

I'll have ~6500 Maven artifacts in a lockfile. It'd be a lot easier to simply express essentially the same info in a single PURL.

Now I am not sure how different this would be for other component types.

@aloubyansky I guess you kinda answered your concerns with ^this - cachi2 cannot tailor a solution to a single use case knowing not all consumers would do their due diligence when it comes to providing input using PURLs straightaway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't see yet how this proposal is better from the "undesirable tainting" perspective.
Could we have an example side-by-side of a final PURL vs a generic one + extra info that would show the advantage of the latter?

Cachi2 is already hooked on PURLs either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. consuming user input with little to no validation and use it for output generation (in this case taking the PURL verbatim) and as such tainting is undesirable in general and so we'd like to avoid it by forcing users to provide the information in form of attributes leading to the desired PURL on the output.

I don't think this is true. Cachi2 is quite literally in control of how much validation it does. It does not have to take the purl verbatim, it is free to decompose it into individual attributes and validate those.

I guess you kinda answered your concerns with ^this - cachi2 cannot tailor a solution to a single use case knowing not all consumers would do their due diligence when it comes to providing input using PURLs straightaway.

You're thinking about other use cases and that is good. However, those are so far all theoretical, while this is a very real one. I can do some research and give you the data, if other purl types convert to url in similar way, if that's the issue.

Other than that, in the case of the custom lockfile for the generic fetcher, you suggested that we use SBOMs instead, which would likely involve letting users supply the purl as part of those SBOM components. I can't see why now, when there's decision between an established format for specifying packages (purl) and a custom implementation that is essentially url + some custom attributes, your reasoning is different.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do some research and give you the data, if other purl types convert to url in similar way, if that's the issue.

I think it'd be worth it

Copy link
Member

@eskultety eskultety Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true. Cachi2 is quite literally in control of how much validation it does. It does not have to take the purl verbatim, it is free to decompose it into individual attributes and validate those.

Does such decomposition involve detection of all extra non-spec qualifiers that we could flag and forbid? Because with YAML attributes this is as easy as validating the lockfile model and failing early and loudly about unrecognized attributes. Besides, I have a feeling that accepting a PURL is inviting the user to expect that, well, since we already consumed the PURL we just use it, right? Not really. Admittedly, either way this is an implementation detail.

Other than that, in the case of the custom lockfile for the generic fetcher, you suggested that we use SBOMs instead, which would likely involve letting users supply the purl as part of those SBOM components. I can't see why now, when there's decision between an established format for specifying packages (purl) and a custom implementation that is essentially url + some custom attributes, your reasoning is different.

@kosciCZ I don't think it is different at all, actually. The proposal of input SBOMs was shut down as too complex and we moved on to a different approach. I don't think I ever advocated for input PURLs anywhere (going back to #652 and more specifically this thread: #652 (comment)), for me it was the future proof type of input when it comes to the plethora of standardized attributes we could consume and look at during artifact processing from an input SBOM, PURLs are just an inherent part of SBOMs which we'd get as a side effect we could not do anything about at all if we had decided to consume it, so I think we may have misunderstood each other in terms of my stance towards PURLs in general. However, this is a new proposal where we have the choice for all input data type/format.

I can do some research and give you the data, if other purl types convert to url in similar way, if that's the issue.

I'm not sure I follow the outcome here, I guess you mean if other PURLs have a direct download URL mandatory? Is that what you're saying you'd like to check?

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
### Cachi2 CLI usage

```
cachi2 fetch-deps --source /path/to/repo generic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with some of the @a-ovchinnikov's reasoning when it comes to checking-in some random file, but then again, imagine a blackbox pipeline, where would such a file come from? It might easily become a devops engineer's nightmare. I also agree with @brunoapimentel that from reproducibility POV and ease of debugging you want users to commit the file to their repos. Now, why would any random big upstream care about cachi2? Exactly, they won't, that's why downstream exists and then such a file is perfectly fine to be checked in to a repo. However, downstream isn't very popular in K8S world and so users tend to solve it with git submodules instead to compensate for the need of downstream control and here again, checking in the cachi2 lockfile is perfectly fine. I don't even think we need a different way of providing it ATM. If the need arises, we can discuss again.

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
@brunoapimentel brunoapimentel force-pushed the purl-enhancer-proposal branch 3 times, most recently from 3a1c1c3 to d1255d5 Compare October 30, 2024 20:58
@brunoapimentel brunoapimentel changed the title Draft proposal: Enhanced purls for generic artifacts Proposal: Enhanced purls for generic artifacts Oct 30, 2024
@brunoapimentel
Copy link
Contributor Author

New pushes: rewrite of the design to consider consuming purls as input, and how the resolution would look like for specific purl types.

Copy link
Contributor

@arewm arewm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a record of what confidence Cachi2 has for the reported contents for the various supported package managers outside of this proposal?

Comment on lines 14 to 13
- Parse the purl by using the [packageurl-python](https://github.com/package-url/packageurl-python) library
- Validate that the purl is within the supported types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier you mentioned that purls are reported as generic. Would Cachi2 also support a generic purl as input? If the supported types mentioned below are not used, would Cachi2 try to just fall back to fetch a generic artifact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the current lockfile option of provinding donwload_url and checksums for the pkg:generic cases, but we can also support consuming purls as input for the sake of consistency.

As for the types, I think we need to be very explicit about which ones are supported, and anything that falls outside of that means a failed request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only way that users would be able to prefetch in a way that produces a generic purl would be through the current lockfile option? There is no proposal to add a purl-supported generic package fetching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't, but we can add one. Do you think it's worth doing it now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, will this proposal generate purls with a pkg:generic type? Or since the generic type isn't supported, we will never be able to produce those purls?

We can always add it later if you don't think that there is a specific need for it with this proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not write anything about consuming input pkg:generic purls here because this is already supported by the current lockfile. So, initally, Cachi2 would not consume pkg:generic purls and for this reason, users wouldn't be able to produce pkg:generic SBOM components by providing input purls.

The mechanism to extend support to any purl type it is already described here, though, so supporting input pkg:generic is trivial (and we probably should, for the sake of consistency).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. completeness for this might be beneficial, but it isn't relevant to the acceptance/rejection of the proposal.

If you support generic with this method, the question might be asked in the future if you should deprecate one of the two generic package fetching strategies. It is probably fine to leave the trivial implementation until it is requested specifically?

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
@@ -0,0 +1,233 @@
# Support for different purl types in the generic artifact fetcher

The generic artifact package manager is being added to Cachi2 as a means for users to introduce files that do not belong to traditional package manager ecosystems (e.g. pip, npm, golang) to their hermetic container builds. Since Cachi2 does not have any extra information about the file that's being fetched, the purls are always reported as [pkg:generic](https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#generic).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a generic purl type and/or a generic property, i.e.

    "properties": [
        {
            "name": "cachi2:found_by",
            "value": "cachi2:generic"
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, a generic purl type. The property is a way to inform which backend/package manager in Cachi2 produced that SBOM component.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add an example purl that Cachi2 might generate with an input that could produce it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is sufficient. I cannot remember why I was asking specifically. This can be resolved.


## Initial thoughts

From a Cachi2 perspective, we can separate purls types into ones that are part of existing package manager ecosystems (such as nuget, composer, maven) and ones that are not (github, huggingface, oci):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefit do we have from this separation? Do we do anything with this classification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do anything with them. I initially had a hard time wrapping my head around non-package manager types (oci, github, huggingface), because they don't simply point to a file in a URL. I initally assumed we don't want to allow this feature to cover other existing package managers, because we'd want full support for a package manager.

I think the question here is: how flexible we want this feature to become. It started as a way to fetch a files, but do we want to extend it to OCI artifacts or even git repos?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of these artifact types vary with how metadata is stored and associated with the content that is downloaded.

Being able to specify/support purl types that could match to future package manager support is one way to simplify the required work for fetching content. The downside to this (from a users' perspective) is that the native dependency files for the package managers are not supported. If, in the future, a package manager-native way to resolve dependencies is developed, it would be an option for the Cachi2 tooling to deprecate the generic purl-based fetching mechanism. This deprecation would, of course, require a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider flexibility to be driven by a case-by-case basis. If there is requested support for fetching some form of artifact, is it required to have the native dependency format (i.e. lockfiles)? Or is it possible/reasonable to implement the fetching with a generic purl-based approach? Implementation with a purl-based approach seems like it would at least be faster if that is an acceptable solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify the required work for fetching content

We want to make sure, for anything with an existing package manager, that we don't break builds, i.e. anything which happens after the fetch.

For instance: if we have to handle some arbitrary level of complexity to resolve the dependency graph we need to download, and/or some arbitrarily complex setup step is involved (e.g. a specific filesystem layout)... we're better off simply using the existing package manager ecosystem and it's lockfiles to resolve the dependency graph, get URLs, download to correct locations from those URLs, etc...

To a first approximation, cachi2 is a very smart "package manager wrapper" - our secondary brief is to make sure that e.g. go build, pip install -r foo, yarn (er... build? >_<) succeed, and our primary brief is that we control the downloads which the build relies on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it is Cachi2's job to ensure that builds are not broken if a purl-driven package manager was used. This should be handled by users' specifications of what content to download ... whether it is by specifying a path to a lockfile or a set of purls. It does, however, seem reasonable that users expectations are set based on the choice of interface.

For example, you may have a lockfile that only has one tarball: https://github.com/stolostron/image-builder/blob/master/yarn-source/package-lock.json

A user should be able to use the npm package manager to resolve that if they want to maintain the package lock. If they don't want to maintain the package lock, they should be able to just fetch that tarball.

Similarly, if there is not support for a specific package manager in Cachi2 but a user wants to prefetch all content, they should be able to take the complexity to individually specify all resolved packages for their builds. Cachi2 wouldn't know that there is complete resolution of artifacts but that is irrelevant.

If Cachi2 wants to record how dependencies were defined (i.e. if they were resolved with a package manager or defined individually), it seems reasonable that the generated SBOM would have metadata to indicate how the specific dependency was resolved.

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved

### Decision points

- Should the checksums be specified as part of the input purl?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the Cachi2 requirements for the checksums for other package managers? This should at least be a best practice for Cachi2 users if it isn't a requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will require checksums, it is one of the few validations we have in place. The question here is about where the checksum should be in the lockfile. A generic artifact looks like this:

- download_url: 
   checksums: https://huggingface.co/instructlab/granite-7b-lab/resolve/main/model-00001-of-00003.safetensors?download=true
        sha256: 90bffe1884b84d5e255f12ff0ecbd70f2edfc877b68d612dc6fb50638b3ac17c

The artifact with a specific type will have a purl instead of the download_url, so should we drop the extra checksums attribute and make it part of the purl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the checksum is a simple/standard addition to a purl, it makes sense to require it. The usability of keeping the checksum separate would just be to make it simpler to see/update it.

Would it be reasonable to support both flows? If the checksum is specified separately, Cachi2 would still generate the purl with the checksum to inject into the SBOM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, CycloneDX components have dedicated fields for checksums (hashes). Perhaps both ways are supported. The CDX maven generator includes ~8 checksums. Probably more than necessary and it doesn't add them to purls, which is probably not a bad idea in such a case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the design so that checksums are only consumed as separate attributes and only a single one of them is reported in the output purl. This is mainly to keep consistency with other existing package managers: we always report a single checksum in case of file dependencies.

We can extend this to support checksums in the purl, in case the need comes up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Presumably Cachi2 would verify all checksums that are provided and it would report the "most secure" one, i.e. sha256 over md5.

### Decision points

- Should the checksums be specified as part of the input purl?
- Should we limit the qualifiers to the types that are strictly available to that type?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this? Are there known qualifiers that should be allowed in the taxonomy for the type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the purl specification loosely determines what are the qualifiers for each type. See oci, for example.

There are also global qualifiers that can be applied to any type (check the qualifiers bullet here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would expect the qualifiers to be verified as part of the parser. Cachi2 would need to ensure that the required qualifiers are present and that the optional ones may be present.

Which of the package-specific qualifiers that are directly specified would Cachi2 use? For example, would OCI artifact support be able to handle the repository_url, namespace, version, ... or would the input purl provide those plus a download_url? Or would support for these qualifiers be considered more for "full" package manager support instead of generic?

Copy link
Contributor Author

@brunoapimentel brunoapimentel Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification for oci seems to indicate the use of repository_url.

I changed the design so that it mentions we will validate the qualifiers depending on type, it is probably better than to leave it completely open.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the content that you have changed yet, but that change makes sense.

docs/design/generic-enhanced-purls.md Outdated Show resolved Hide resolved
- Should the checksums be specified as part of the input purl?
- Should we limit the qualifiers to the types that are strictly available to that type?
- Should we allow types that are not files (git repos, OCI artifacts)? Should they be reported as different component types?
- How should our policy be regarding extending the generic fetcher for other package managers we don't fully support? Would this impact the will of contributors to provide full support for a package manager?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the difference be between fetching some dependency using a supported package manager vs. using the generic artifact method for retrieving content? Is there any difference in the guarantees/fidelity/integrity of the data?

If there are differences, then I feel like I would tend towards failing the use of the generic artifact fetcher in favor of the more specific tool.

Copy link
Contributor Author

@brunoapimentel brunoapimentel Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this. If we compare it to most package managers that allow file dependencies (e.g. npm, pip), in the end, we're still using aiohttp to download the files, the main difference is that the dependency is defined in the expected lockfile for that package manager (package.json, requirements.txt), which gives a little more confidence on the package type. I don't think this means we have any other guarantees, nor greater confidence in fidelity or integrity. Any file downloaded will have its checksum validated, so we're sure we're getting the right content.

From this reasoning, we could probably even accept the generic fetcher to download pip or other supported package manager artifacts, which would only be useful in case a user needed a single artifact and did not want to initialize a project in his repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment earlier is also relevant for this thread: #692 (comment)

What you say here makes sense and I think it matches with my mental model of the interface/results.

@brunoapimentel brunoapimentel marked this pull request as draft October 31, 2024 19:06
@brunoapimentel brunoapimentel force-pushed the purl-enhancer-proposal branch 4 times, most recently from 060befa to edac7be Compare November 4, 2024 21:57
}
```

## Alternative: consume options instead of a purl
Copy link
Contributor Author

@brunoapimentel brunoapimentel Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aloubyansky How do you feel about this alternate proposal? Would it be difficult to generate the lockfiles for PNC/Maven artifacts?

My initial feeling is that having the attributes broken down is clearer and avoids the need to parse the purl according to the spec. The flexibilty to define the output purl would still exist, since we can add any needed attributes in order to generate the purl we want.

@arewm Wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could work. To make the options complete for Maven, there could also be the classifier and the version. It'd be x6/7 times more lines per artifact compared to providing just the PURL (for Maven at least it's pretty straightforward).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the alternative yet, but based on your comment, it is effectively the effect your comment above: #692 (comment)?

It is a recognition that different purl types will have different field requirements, so you would expose those fields as being required?

It seems like it would be harder to validate the validity of fields because it could be possible for a user to add content to an attribute which would end up matching some other part of the purl spec (i.e. if control characters are not properly escaped). If you default to consuming a purl itself, then you could validate the parsing of the purl and then only update those specific sub-elements which are relevant for cachi2 to control (i.e. uri/hash).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To record some parts of a conversation that I had out-of-band:

  • I don't see an issue with this approach, but you should ensure that you are making the decision for the proper reason.
  • With this method, I think it would still be required to validate the entire purl after it is assembled. Therefore, I don't think that this would alleviate the need to parse a purl
  • This seems to mostly affect user input. Is it a better user experience to provide values based on some set of required/optional keys? Would users be more likely to do this properly than if they were to specify the purl in its entirety?
  • Will the verbosity (i.e. the increased number of lines in the configuration file) be an issue? Or would it be a desired feature?

- purl: pkg:maven/io.quarkus/[email protected]?type=jar&repository_url=https://maven.repository.redhat.com/ga
checksums:
sha256: d16bf783cb6670f7f692ad7d6885ab957c63cfc1b9649bc4a3ba1cfbdfd5230c
target: quarkus.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this target the location on-disk for where the fetch artifacts would be put? Would all artifacts need to specify this so that they can be appropriately used by some later stage or would there be default/assumptions for package managers about locations and file names?

@brunoapimentel brunoapimentel force-pushed the purl-enhancer-proposal branch 2 times, most recently from 593deda to 62fb6d2 Compare November 6, 2024 12:25
@brunoapimentel brunoapimentel marked this pull request as ready for review November 6, 2024 12:25
@brunoapimentel brunoapimentel force-pushed the purl-enhancer-proposal branch 3 times, most recently from a5c2632 to 6c2a617 Compare November 12, 2024 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants