Skip to content

Commit

Permalink
Draft: Proposal to enhance the generic artifacts fetcher purls
Browse files Browse the repository at this point in the history
Signed-off-by: Bruno Pimentel <[email protected]>
  • Loading branch information
brunoapimentel committed Oct 21, 2024
1 parent e64c2eb commit ff79377
Showing 1 changed file with 150 additions and 0 deletions.
150 changes: 150 additions & 0 deletions docs/design/generic-enhanced-purls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Purl enhancer for the generict artifact fetcher

The generic artifact package manager is being added to Cachi2 as a means for users to introduce files that do not belong to traditional package manager ecosystems (e.g. pip, npm, golang) to their hermetic container builds. Since Cachi2 does not have any extra information about the file that's being fetched, the purls are always reported as [pkg:generic](https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#generic).

There are use cases that would benefit from more accurate purls, though, such as the recent Maven artifacts [proposal]. Considering that the purl specification already identifies several types of packages that don't fit into traditional package manager (e.g. github, docker, huggingface; see the [purl types spec](https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst) for more info), this proposal builds on top of the fundamentals of the generic fetcher to provide an extensible mechanism that would allow Cachi2 to fetch files from specific sources and report them with matching purl types.

## Enhanced purls overview

- Implement an `enhancer` for every supported purl type, which is essentially a set of rules that will be applied to a generic artifact and, in case those rules can be matched, it will replace the generic purl for a more specific type.
- The most generic rule seems to be the package's origin. With famous and well established public registries such as `maven.org` or `huggingface.co`, it is easy to correlate the package's origin with a more specific type.
- Mapping famous and public registries will prevent the use of private registries. To solve this, we need to allow users to change Cachi2's default configuration to enable additional source URLs for each purl type.
- More rules might be useful or even necessary when generating specific purl types. For example, Maven has the concept of groupId, artifactId and version (GAV) for each of its artifacts, and these values are fundamental in defining both the download URL and the purl.
- Extend the generic artifacts lockfile specification to introduce a `type` attribute that allows users to hint at which purl type that artifact should have.
- The only types that can be selected are the ones that had their respective `enhancers` implemented. We don't want to allow users to be able to freely specify the purl type for a generic artifact, but rather restrict this use to a specific subset of purl types.
- Any failures to match the hinted `type` will not cause the request to fail. The file will still be fetched, but the purl will revert to `pkg:generic`.

## A practical example

### Input files

**generic_artifacts.yaml**
```yaml
metadata:
version: '1.0'
artifacts:
- download_url: https://github.com/containerbuildsystem/cachi2/archive/refs/tags/0.11.0.tar.gz
target: cachi2_0_11_0.tar.gz
checksums:
sha256: fa0d536389db15fb3dabdb3b3d08354f47f765a653178140bfbe1b3de1a6ee76
- download_url: https://maven.repository.internal.com/ga/io/quarkus/quarkus-core/3.8.5.internal-00004/quarkus-core-3.8.5.internal-00004.jar
target: quakus.jar
type: maven
checksums:
sha1: e4ca5fadf89e62fb29d0d008046489b2305295bf
- download_url: https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/blob/b919e5d07ce15f31ea741f2be99a00a33c3b427b/model-00001-of-00030.safetensors
target: llama_3.1_1_of_30.safetensors
type: huggingface
```
**.cachi2-config.yaml**
```yaml
generic-artifact-sources:
maven:
# sample internal registry
- maven.repository.internal.com
- maven.org
huggingface:
- huggingface.co
```
### Cachi2 CLI usage
```
cachi2 fetch-deps --source /path/to/repo generic
```

### Enhancer high-level definition

- MavenPurlEnhancer:
- validates the origin URL
- parses the download url and converts it into the expected purl
```bash
# sample url
https://maven.repository.internal.com/ga/io/quarkus/quarkus-core/3.8.5.internal-00004/quarkus-core-3.8.5.internal-00004.jar

# how the parsing will be done
https://{repository_url}/{as_dir(group_id)}/{artifactId}/{version}/{artifact_id}-{version}.{extension}

# resulting purl
# note that the type will need to be infered from the extension and potentially additional attributes
pkg:maven/{groupId}/{artifactId}@{version}?type={type}&repository_url={repositoryUrl}&checksums={algorithm:checksum}
```
- in case of failure, file will be reported as generic, warning will be issued


- HuggingFacePurlEnhancer:
- validates the origin URL
- parses the purl
```bash
# sample url
https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/blob/b919e5d07ce15f31ea741f2be99a00a33c3b427b/model-00001-of-00030.safetensors
# parsing the url
https://{repository_url}/{namespace}/{name}/blob/{commit_hash}/model-00001-of-00030.safetensors
# resulting purl
pkg:huggingface/{namespace}/{name}@{commit_hash}&download_url={download_url}
```
- in case of failure, file will be reported as generic, warning will be issued

### Resulting SBOM
```json
{
"components": [
{
"name": "cachi2-0.11.0.tar.gz",
"purl": "pkg:generic/cachi2_0_11_0.tar.gz?checksum=sha256:fa0d536389db15fb3dabdb3b3d08354f47f765a653178140bfbe1b3de1a6ee76&download_url=https://github.com/containerbuildsystem/cachi2/archive/refs/tags/0.11.0.tar.gz",
"properties": [
{
"name": "cachi2:found_by",
"value": "cachi2:generic"
}
],
"externalReferences": [
{
"type": "distribution",
"url": "https://github.com/containerbuildsystem/cachi2/archive/refs/tags/0.11.0.tar.gz"
}
],
"type": "file"
},
{
"name": "quakus-core",
"version": "3.8.5.internal-00004",
"purl": "pkg:maven/ga.io.quarkus/[email protected]?type=jar&repository_url=https://maven.repository.internal.com&checksums=sha1:e4ca5fadf89e62fb29d0d008046489b2305295bf",
"properties": [
{
"name": "cachi2:found_by",
"value": "cachi2:generic"
}
],
"externalReferences": [
{
"type": "distribution",
"url": "https://maven.repository.internal.com/ga/io/quarkus/quarkus-core/3.8.5.internal-00004/quarkus-core-3.8.5.internal-00004.jar"
}
],
"type": "file"
},
{
"name": "Llama-3.1-Nemotron-70B-Instruct-HF",
"purl": "pkg:huggingface/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF@043235d6088ecd3dd5fb5ca3592b6913fd516027&download_url=https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/blob/b919e5d07ce15f31ea741f2be99a00a33c3b427b/model-00001-of-00030.safetensors",
"properties": [
{
"name": "cachi2:found_by",
"value": "cachi2:generic"
}
],
"externalReferences": [
{
"type": "distribution",
"url": "https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/blob/b919e5d07ce15f31ea741f2be99a00a33c3b427b/model-00001-of-00030.safetensors"
}
],
"type": "file"
},
]
}
```

0 comments on commit ff79377

Please sign in to comment.