-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: HTTP reverse proxy cache #3610
Conversation
Thanks a lot for digging into this, and documenting all this! This has really helped me to understand the varnish cache system better already. Below I brainstormed a few questions which I'd like to discuss at a meeting.
|
Trying to give some (opinionated) early answers.
Difficult to say without measuring real live data. From a user perspective, initial data load after first login will always be a cache miss. Cache hits will be observed for:
However, this is from a user perspective. HTTP caching kind of serves two "customers", the individual user and the server (or overall application). So equally important the HTTP cache protects the server from too high load
Decreasing the load for the server in turn provides a faster experience for all the other cache misses. Normally, these 2 benefits align: Higher hit rate provides faster user experience and at the same time decreases the load for the server. However, alignment is not always the case (see next question on response sizes), in which case a balance needs to be found between "best possible speed for each user" or "overall high load throughput". For me personally, the second benefit (server) even outweights the first benefit (individual user). Having a caching functionality in the back of our hands would be a nice "weapon" to a have, once we decide to go out of beta and open the application to anyone. Minor point: We have some endpoints (like
Doesn't fit very well 😸 Caching and invalidation works better for smaller responses with less embedded data. I can see both options:
The same is true for other endpoints which embed a lot of data. We might decide to keep them as-is and accept that larger ones become uncacheable. Or we might decide to reduce embedding. Increasing network requests and reduce embedding of data might be a disadvantage though for people on a network with high latency (just an assumption, would need to be tested). So this might be one of the tradeoffs between user incentive and server incentive.
By default, varnish doesn't write any logs to disk but only to memory. So far I used We should also add some headers in the response which indicate at least what is a cache hit and what is a miss. I've seen also snippets which enable this only, if a specific debug flag is set, to avoid providing too much information to outsiders. For metrics: Haven't looked into this yet. There is a prometheus exporter, seems however not much development lately.
My opinion (obviously to be discussed): On all deployments, so we see & test the same that later is pushed to production. Locally, I guess we want developers to be able to develop & test both with and without caching. Especially for development on backend functionality, you probably want to develop first without HTTP cache and only sanity check at the end, if things work properly with caching enabled.
Currently (in this PR), cache expiration is set to 3600s/1h. I guess we could confidently live with much higher values, but this needs to be fine-tuned with real measurement (hit rate, cache memory size, etc.). I don't think users should & would need a cache purge functionality. This is something we need to implement correctly and not something users should understand and worry about. Maybe we want to have an admin functionality to purge the cache for troubleshooting cases. For deployments, my naive approach would be to ensure that Varnish is always restarted or purged with each new deployment.
This goes back to user vs. server perspective. From a user perspective, the relativ gain obviously depends a lot on their network. From South America I currently see the current timings on production server:
So a cached version of /camps would save me probably 200ms (approx. 50%) in this specific case. Results for someone with fiber from CH are hopefully different :-) And also depends on the endpoints. On my localhost (without xdebug), most requests are in the range of 100-200ms. But I think I've also seen requests >1s on deployments during performance testing. Additional latency of cache proxy: Purely proxy side, neglectable. Varnish is really fast. Unless we implement something like preflight requests to the backend. From a server perspective, the performance increase directly correlates with the hit rate we can achieve. |
Edit: I added a 2nd example for camp specific routes for ScheduleEntry. I figured that the previous example with Category was too simple, because Category is a direct child of Camp. The example for ScheduleEntry is more generic and works with all resources of type Relevant commits: This is also related to api-platform/core#5673 Edit 2: This becomes a bit cleaner once api-platform/core#5732 is merged & released. |
Sounds really cool, also good that you went far enough to find the first potholes. A possible approach would be:
4.Get some monitoring working in production. (cache hits, misses, request duration on cache miss, cache purges, resource usage).
Whan other maybe naive approach would be to let the client include the campId in a header, aka "i am now navigating on camp xy, give me the cached responses of this camp" |
Core Meeting Decision2 Endpoints:
Create a feature toggle |
…: also purge old and new collection routes
@pmattmann I addresses the bug we were shortly discussing at the end of the meeting (see commit 3f604db). In case you want to have a look at it.
|
I don't know what improved the tests, i assume ae74c7b |
…rgeHttpCacheListener
Looks good. I need a quick refresher on one point. The endpoint Shouldn't this only be done for referenced link collections? |
At the moment, all relations are included in the tag list, just to play it safe we don't miss any edge case. As you identified correctly, in the case of This is probably an area for later optimization (=smaller list of cache tags). But he have to careful here, we cannot include collections only. Most probably it's safe to remove ManyToOne relations (like |
I tried to filter out the |
We would still need to figure out how we solve OneToOne relations, though. I'd prefer if we keep this as a backup-PR at the moment. And we take it up as an optimization, once we implement caching for the first resource with a OneToOne-relation. Ok for you? |
👏👏👏 nice |
This PR implements reverse proxy caching for the following endpoints:
The cache-hash includes the JWT cookie, so the cache is personal for each user (for each login/JWT, to be specific).
Cache is purged automatically for updates/deletes/creates that impact the cached responses. A xkey strategy is used for this which deviates from the standard api-platform cache tag strategy.
This PR also includes upgrade to api-platform 3.3For simplicity of code review, it would make sense to merge #4942 beforehand.
To do:
To do after review / before merging:
Blocked by:
need PR feat(serializer): collect cache tags using a TagCollector api-platform/core#5758 and PR fix(serializer): fix TagCollector for JSONAPI and HAL format api-platform/core#6076
compatibility with symfony 7; Update to Symfony 7 FriendsOfSymfony/FOSHttpCacheBundle#598
See below for the original description of this PR.
This is an example POC for a HTTP cache in front of our API.
General
What's the purpose
HTTP reverse proxy sits in front of the application and caches HTTP responses. Originally, this was mostly used for static content. However, with a smart invalidation mechanism in the application, HTTP caches can also be used for dynamic data.
Cache tags & surrogate keys
Most HTTP caches implement the invalidation with surrogate keys (specific implementation of cache tags).
This recording (Take your Http caching to the next level with xkey & Fastly) is a bit older (2018) but provides a good and simple overview of how cache tags work. The presentation is based on varnish with xkey (=Surrogate keys) and Symfony FOSHttpCache.
Implementation in API/core
api-platform/core already includes an implementation for automatically adding cache-tags and to invalidate them. Out of the box, it supports Varnish & Souin (although implementing support for any other HTTP cache like Fastly, Cloudflare, etc. would be fairly easy).
In theory, Cache tags could really be any sort of strings. In api-platform, it's however implemented by referencing IRIs. When cache invalidation is enabled, the following is the default behaviour of api-platform:
Every response contains an additional HTTP header (cache-tags, xkey, etc.; depends on the configuration) which references all IRIs of all entities included in the response (both embedded entities and linked entities). This tags are collected during the normalization process (
$context['resources']
) and added to the HTTP response in AddTagsListener.A event listener is subscribed to changes on Doctrine entities (updates, inserts, deletions). If any such change is detected, a purge request is sent to the configured HTTP cache. This purge request included the relevant cache-tags (=IRIs) to purge
Souin vs. Varnish
Earlier, Varnish was included in the official api-platform/api-platform template. At the moment, however, neither Varnish nor Souin is included out of the box (docs are outdated, as they mention Varnish is still included).
There are open PRs for both integration of Varnish and Souin (api-platform/api-platform#2383).
From the discussion on api-platform, I got the impression that Souin is more simple and more modern, so I tried Souin first. Was not really happy, though: Documentation is very meager. I had to look at Souin code and PRs multiple times to figure out how to use it. Finally, I struggles to somehow include our JWT cookies in the cache key, so I gave up with Souin and switched to Varnish.
Hence, this PR includes Varnish. It might be a bit more complex initially. However, documentation is quite good. It is widely used and hence well proven. And the VCL language is really powerful and allows almost everything to be implemented.
This PR
This PR implements a simple setup of Varnish in front out our API.
How to test out
What is implemented
return(pass)
)Beneath the simple setup, the following "advanced" features are implemented preliminary:
The commit history is clean. So it might make sense to walk through the individual commits to understand the implementation (especially to see the changes made to PurgeHttpCacheListener).
Use cases & examples
Basic caching functionality
Basic invalidation/purge
Invalidation of collection
Cache scoped by JWT
Invalidation via CampCollaboration
Invalidation scoped by camp
(for an explanation of this functionality, read below "Frequent invalidation of collections")
Potential issues
Documentation of issues I ran into or potential issues I can see.
Too many surrogate keys (header too long)
Due to the fact, that each IRI is included in the response cache-tag header, this header can get really large. So large, that in fact it exceeds the limit Varnish has configured for HTTP headers. On ecamp: Try the /activities endpoints. Chances are, the request will fail with 500 due to this issue.
Others have run into this issue as well. On the linked issues, several options to remedy are discussed.
Besides the options listed in the issue, a straightforward solution is to reduce the number of embedded entities (which could make sense once caching works).
A minimum fail-safe implementation would check for the header size, and if the the size is too large, remove cache tags and disable caching for this specific response (=worst case a response cannot be cached, but at least doesn't resolve in a 500 error).
RelatedCollectionLink
The cache tags for a HTTP response are collected during normalization.
Due to the way we had to implement RelatedCollectionLinkNormalizer, too many cache tags are included in the response. This is related to the POC PR #3559, which would solve both performance issues and too many cache tags issue.
API platform implementation of cache tags
The implementation in api-platform/core seems to work functionally, but in my option is not very optimal. There are way too many entities purged during write operations.
The overall strategy of api-platform is to ensure responses are exact (=cached response and actual API response always need to match), sacrificing cache hit rate if necessary (see also this comment). This makes a lot of sense, however, I still think we could do better.
As an example, api-platform includes all IRIs in the cache tags, both from embedded and linked entities. However, the purge behaviour for linked entities could and should be different from embedded entities.
Need to dig into this a bit deeper though and will check if I include any improvements in this PR or open a PR directly to api-platform.
Edit: Opened a PR on api-platform for this
Frequent invalidation of collections
Every POST or DELETE operation will purge the collection resource. This is obviously necessary because, the collection response now contains 1 entity more or less. Due to the fact, that query parameters are not part of the cache tags, this purges every variation of the collection endpoint.
Example: POST on /activities to add a new activity to /camp/1 will purge
For applications like a CMS, where most of the operations are READ and only a few users edit entities, this might work. For an application like ecamp, this would however invalidate collection endpoints very frequently. To the extent, that the cache hit rate on collection endpoints would almost be 0, once enough users are working & editing on the platform.
As most our our queries are directly or indirectly scoped to a specific camp, one solution could be to include the campID as a mandatory part of the cache-tag. There could be various solutions to achieve this. The most straightforward variant is to include the campID directly in the URI as a uriVariable (as this is partially already supported by api-platform).
In this PR there's an example implemented for the category endpoint. Other ideas on how to solve this highly welcome.
Response depends on entities other than the ones included in the response body
If a response depends on entities other than the ones included in the normalization process, these entities would have to be added manually to the cache tags. In our case the most prominent example is access control to camp data trough CampCollaboration entities. In most responses, the CampCollaboration is not part of the actual response. However, the CampCollaboration entity defines whether I have access to the camp data or not, hence cache needs to be purged on change to the CampCollaboration entity.
In this PR, this is implemented for both Security voters, and the entity responsible for granting access to the resources is added to the cache tag (99c3723).
The other point in our code, where responses varies depending on other entities are doctrine filters in the repositories (most prominently FiltersByCampCollaboration). This is not implemented yet and is much related to the previous topic of frequent invalidation. Potential solutions:
Further development
Production readiness
This example PR is obviously not production ready. Beneath the implementation of deployment, the following resources contain some VCL code snippets that are worth reviewing and implementing in case they make sense.
JWT parsing
Currently, the JWT cookies are included in the hash key (in the VCL) but no parsing of JWT happens on the reverse proxy side. However, this could be implemented in Varnish, for example to
Resources for JWT parsing in Varnish (for my own documentation):
Blog post, HS256, cookies
https://feryn.eu/blog/validating-json-web-tokens-in-varnish/
Based on previous blog post, RS256, Authorization header
https://github.com/opositatest/varnish-jwt
Improved versions, supporting both HS256 and RS256
https://stackoverflow.com/questions/70607615/varnish-how-to-check-jwt-signature-using-digest-vmod
https://code.uplex.de/uplex-varnish/libvmod-crypto \
https://code.uplex.de/uplex-varnish/libvmod-frozen/tree/master/examples/jwt
Shared cached for a camp
As of this PR, users don't share cache data. However, within a camp, most if not all responses are identical between users who have read access to the camp. Hence, an idea to reduce cash size and increase hit rate would be to use campID as a hash key instead of the JWT-cookie or userID.
Theoretically, this would be feasible if:
This is definitely not straight-forward, so more of a "potential further development" at a later stage than part of an initial implementation.
Edit: Just found out, that there is a 2nd way to implement this without integrating camp claims into JWT. The FOSHttpCacheBundle has a functionality they call "User Context". In essence this is a preflight request from varnish to the backend/symfony asking for a hash key. Seems like an elegant solution, however, only makes sense if this preflight logic is really fast and the hit rate relatively high. Otherwise we just add additional latency.
FOSHttpCacheBundle
FOSHttpCacheBundle (which integrates FOSHttpCache into Symfony) is a symfony package which supports adding cache tags to responses and purging tags over reverse proxy APIs (currently supporting Varnish, Symfony Cache, Cloudflare, and others).
This is the package used in the youtube video linked earlier. The api-platform implementation is not based on FOSHttpCache (don't now why). But as of today, the current implementation in api-platform is very difficult to extend without touching the actual code of api-platform.
Switching to FOSHttpCache might be necessary, if we feel too limited by api-platform itself. Which is the own statement of api-platform (api-platform/core#952):