Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vendor.tar.gz archives should be idempotent between service runs #40

Open
marcosbc opened this issue Nov 6, 2023 · 3 comments
Open

Comments

@marcosbc
Copy link

marcosbc commented Nov 6, 2023

We are observing that tarballs generated by obs-service-go_modules are not idempotent, i.e. the archives are not bit-identical after execution even if there are no changes in the file contents, and therefore their checksums differ. For us, having idempotent archives is useful as a re-execution of the source service with identical file contents would avoid the file being stored again in our repositories.

For example, executing this service twice gives different results even if the file contents are the same:

$ rm -f vendor.tar.gz; ./go_modules --outdir $(pwd); md5sum vendor.tar.gz; ./tarsum <vendor.tar.gz >vendor-orig
(...)
e94b934dafe9d375199543e5391f9571  vendor.tar.gz
$ rm -f vendor.tar.gz; ./go_modules --outdir $(pwd); sha256sum vendor.tar.gz; ./tarsum <vendor.tar.gz >vendor-latest
(...)
978de70d571768e27d6b5fb41dfcb971  vendor.tar.gz
# check if the file contents have changed
$ md5sum vendor-orig vendor-latest
ac565e66ded7d78dc409548f42d39a61  vendor-orig
ac565e66ded7d78dc409548f42d39a61  vendor-latest

In this example we were using the tarsum script (you can find it here) to calculate the checksum of each individual files inside the archive. And as you can see, it is identical for both cases so the actual contents of the archive is the same.

Note that in other plugins such as obs-service-node_modules, this does not seem to happen since a re-execution generates bit-identical archives.

@jfkw
Copy link
Collaborator

jfkw commented Nov 6, 2023

Thanks for surfacing this issue. I had noticed that vendor.tar.gz changes on every run and would like to eliminate that if we have the necessary controls to do so. Considerations to investigate:

  • Is this limited to tarballs or all supported archive formats? I'll check, but expect that this is due to the behaviour of go mod vendor, see next item.

    This may be a good time to mention that with the move to use libarchive, we support more archive formats than I think we need. If we had to choose e.g. because special case handling per-format is needed for archive reproducibility, I would prioritize .tar.gz, .obscpio and .zstd.

  • go mod vendor does at a minimum write files with new time metadata on each run. We may not be able to do anything about that bevaviour. There are no command options to do a dry run and discard if no changes:

    go help mod vendor
    usage: go mod vendor [-e] [-v] [-o outdir]
    
    Vendor resets the main module's vendor directory to include all packages
    needed to build and test all the main module's packages.
    It does not include test code for vendored packages.
    
    The -v flag causes vendor to print the names of vendored
    modules and packages to standard error.
    
    The -e flag causes vendor to attempt to proceed despite errors
    encountered while loading packages.
    
    The -o flag causes vendor to create the vendor directory at the given
    path instead of "vendor". The go command can only use a vendor directory
    named "vendor" within the module root directory, so this flag is
    primarily useful for other tools.
    
    See https://golang.org/ref/mod#go-mod-vendor for more about 'go mod vendor'.
    

    FWIW, I would be very happy to see upstream go expose the module and vendoring machinery as a library. The functions are there, but the interface is not stable/public at this time.

  • Setting file times under vendor/ to the file times of go.mod is one idea that comes to mind. That would be stable when the source archive is stable.

    This is not entirely straightforward to implement: the most common packaging pattern is to build the source archive via obs_scm / tar_scm. This works well across the git-centric Go ecosystem and the changelog generation is valuable to packagers. Thus the file times of go.mod may also change across repeated service runs. I'll check to confirm this behavior.

  • We don't want to presume that any metadata under vendor/ e.g. modules.txt we check would be a sufficient reason to use a previous version of the contents of vendor/. The correctness of the contents of vendor/ must take the highest priority. I'm willing to take whatever steps we can to normalize/quantize superfluous metadata e.g. fine grained file times on inputs to archive creation if that helps and if we have those controls. I've not previously worked with libarchive to control these aspects of archive reproducibility. If it has applicable controls via function arguments, I'm open to using them to address this issue.

  • A strategy of preserving two vendor-new/ and vendor/ then comparing them seems like it might have numerous modes of failure or correctness error. I'm open to ideas for strategies that can work reliably.

@jfkw jfkw changed the title Tarballs should be idempotent vendor.tar.gz archives should be idempotent between service runs Nov 6, 2023
@diconico07
Copy link

One way to handle that is to set the files times in the archive to the same as the go.mod (or would it be better to use go.sum here) mtime.
More things are needed for gz and obscpio as these format have other variants embedded:

  • libarchive by default add a timestamp to gzip headers
  • cpio newc format have multiple fields preventing out of the box reproducibility with libarchive (would need to emulate cpio command --reproducible option)

@diconico07
Copy link

I created #55 to do this, so it works for at least .tar.gz, .tar.xz and .tar.zst formats.
Cpio is out of scope of this PR as it requires things that are not doable with libarchive as of today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants