-
-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache key for saving / restoring the link check cache (e.g. for GH Actions) #671
Comments
The cache is so wildly broken right now; I'm working on a new major release (#669). |
I don't know much about how GitHub Actions work, but as of #672 the cache logic has been completely rewritten. Would it be enough to have a, let's say, an md5 of the cache log file? |
@gjtorikian, excited to see the new caching logic in action with the new v4, thanks for the big effort! As for the cache key, we would need to compute it before checking the actual links and use it to restore the actual cache log file from a previous run. An md5 of the links to be eventually checked and cached would do... in a sense a cache log file "skeleton" with the links but w/o the result of the check would be enough, although I am not sure this fits the way the cache is implemented. What would probably work is to construct html-proofer/lib/html_proofer/runner.rb Lines 66 to 79 in 041bc94
A (perhaps stupid) idea, trying not to be too disruptive and not too tailored to the cache key problem: add a new option like
The dumped extracted links could be useful per se, and the file(s) can be easily hashed to construct the key for GitHub Actions via the built-in |
Ok, thanks for walking me through that. For external links this should be easy: an external link is the same no matter what file it appears in. Internal links are a different monster. While we can just keep track of all the links found, but I think the source file name would also need to be preserved. The current cache logic does do this; I’m not sure how it would like to hash/preserve during CI runs. (To be exploit about this: a link to |
So, here's what I'm thinking. The format for the cache file is roughly like this: {
"version": 2,
"internal": {
"/somewhere.html": {
"time": "2022-01-01 22:46:47 -0500",
"metadata": [
{
"source": "spec/html-proofer/fixtures/cache/internal_and_external_example.html",
"current_path": "spec/html-proofer/fixtures/cache/internal_and_external_example.html",
"line": 11,
"base_url": "",
"found": null
}
]
}
},
"external": {
"https://github.com/gjtorikian/html-proofer": {
"time": "2022-01-01 22:46:47 -0500",
"found": true,
"status_code": 200,
"message": "OK",
"metadata": [
{
"filename": "spec/html-proofer/fixtures/cache/internal_and_external_example.html",
"line": 7
}
]
}
}
} It seems to me that if I:
This should be enough to preserve a cache between runs. I feel like I'm missing something around timestamps, though. For example, if a CI run occurred on March 2021, and then ran again in March 2022, with no fundamental changes to the contents...the md5 hash would be the exact same, but the cache should be invalidated, because it's possible the external links changed in that time (maybe the website was removed or whatever). Then, I think how I'm storing the timestamps right now are wrong; it should be on the top-level, rather than per-link, for easier access; and if I do that, then the fingerprint can be the timestamp of the last run plus the md5 hash ( What do you think? Am I missing anything? |
@gjtorikian, you have a fair point about the timestamps... I will have to give it a few more thoughts, but indeed the timestamps is an important aspect to ensure the cache is saved if links expire. Hope to get back to this shortly. |
Let me take a few steps back and try to reason about cache keys. TL;DR: Timestamps play indeed an important role for an ideal primary key, but they cannot be used to construct the key => We are better off not defining a "smart" key just suggesting an "aggressive" key unique to each run forcing the cache to be always saved. (Yet another) Summary of GitHub Actions cache behavior
Main target: make sure the HTMLProofer cache is saved when a relevant update of the cache content occurs (which implies defining a different primary key)
I fear we might have to use a conservative approach / workaround and make sure the cache is always saved in each (successful) run
|
As discussed in #664 (comment), saving / restoring the HTMLProofer cache in CI tools like GitHub Actions suffers from the difficulty of defining a proper key reflecting the the content of the cache:
It would be great if HTMLProofer could "return" (perhaps just to standard output for the command-line usage) the links or a computed hash of the links to be checked.
Note that the approach currently mentioned in the README suffers from the issue described in #664. I have been switching to using the commit SHA to work around this.
The text was updated successfully, but these errors were encountered: