Feature proposal: improved caching support #1665

kschmid · 2023-08-13T16:34:28Z

kschmid
Aug 13, 2023

Due to rate limiting of the yahoo api, there is certainly a need for caching, which is also exemplified with the proposal of smart scraping, based on caching given.
However, it would be very helpful to control to some extend the way the cache is handled.
a) was a specific request served from cache or from network. This is part of the response, however, as this is inside yfinance, it can't be checked by the application program. It would be great, if this could be made available to the application, e.g., via a call to a get_last_caching_state (or similar), which would provide this part of the response.
b) it would be great, if a ticker could change the assigned session in order to bypass the cache altogether. Right now, I can only do something like this, by creating a second ticker with a different associated session, but that means that also initialization of this ticker will bypass the cache, unnecessarily increasing the number of requests.

rickturner2001 · 2023-08-14T19:35:24Z

rickturner2001
Aug 14, 2023

Hey @kschmid this seems like a great idea. What kind of caching were you thinking about? I can see how we could implement caching for .info for specific tickers, but I don't really se a way of caching OHLCV values

0 replies

kschmid · 2023-08-14T19:52:41Z

kschmid
Aug 14, 2023
Author

Dear @rickturner2001 first off, I am thinking about this:
https://github.com/ranaroussi/yfinance#smarter-scraping
This simply relies on session caching. Which works fine, but basically caches everything. If one uses it like this (which I understand as the recommended way), then caching support basically means, being able to manipulate session, as caching is a session characteristic. (e.g., when you download minute data, you could choose to bypass caching, by using an uncached session, but when using slower data, one would use a cached session.
I tried to do this directly, but first this is without documented interfaces and second I failed miserably as pointed out here: #1647 (comment)

Another (minor) issue would be to forward things like the responses, as this contains information on whether a request was served from a cache or directly. All of this is basically web-caching.

When going for smarter ways of caching on the application level than probably something like yfinance-cache comes to mind. Not sure, whether the project wants to go as far as incorporating this.

0 replies

rickturner2001 · 2023-08-14T20:15:30Z

rickturner2001
Aug 14, 2023

We might implement this as a feature. Can discussion be referenced by id @ValueRaider ?

1 reply

ValueRaider Aug 14, 2023
Collaborator

Yes, although the ID has changed.

rickturner2001 · 2023-08-15T07:48:29Z

rickturner2001
Aug 15, 2023

@kschmid why not just use the smart_scraping proposed in the documentation as you pointed out? What problem are you having with that?

7 replies

kschmid Aug 15, 2023
Author

Well, I can only make proposals. ;-)
As it stands, the current "smart scraping proposal" is at least misguiding the user. So, a pointer to more info is certainly useful.
As currently, there is no other relevant documentation (Wiki is not used / linked, etc.) putting it elsewhere is basically throwing it away. Which is fine with me.

Regarding the second half: there is little information on what yfinance-cache does. From this I cannot see that it provides for example "on-request cache bypassing" based on application input or response inspection. But of course, there may be much more which does not become clear from the overview.

ValueRaider Aug 15, 2023
Collaborator

As it stands, the current "smart scraping proposal" is at least misguiding the user

Please explain, because way I see it dumb caching is smarter than caching nothing and getting rate-limited/blocked.

Wiki is not used / linked, etc.

The README can be edited like any other file, you can add a link, like is done for history().

I cannot see that it [yfinance-cache] provides for example "on-request cache bypassing"

It's called yfinance-cache, don't try to bypass it. Best to forget all your caching efforts with yfinance and read this tiny section with a fresh mind.

kschmid Aug 15, 2023
Author

As it stands, the current "smart scraping proposal" is at least misguiding the user

Please explain, because way I see it dumb caching is smarter than caching nothing and getting rate-limited/blocked.

It does serve some stuff from the cache, which may be invalid (and unexpected), leading unexpected behavior (especially intraday). Needed my some time, may need others as well. But I got over it, others will as well.

Wiki is not used / linked, etc.

The README can be edited like any other file, you can add a link, like is done for history().

I cannot see that it [yfinance-cache] provides for example "on-request cache bypassing"

It's called yfinance-cache, don't try to bypass it. Best to forget all your caching efforts with yfinance and read this tiny section with a fresh mind.

You are right, I was wrong. The arguments would probably allow for appropriate caching from an application level. (Though no way to determine whether results come from cache or are fresh as far as I see, but this would then probably better go to yfinance-cache..)

rickturner2001 Aug 15, 2023

You can simply have a logger at level DEBUG. Cached requests don't log anything

kschmid Aug 16, 2023
Author

Thanks Rick, will try this.

rickturner2001 · 2023-08-16T21:54:06Z

rickturner2001
Aug 16, 2023

@kschmid I am not too satisfied with the way yahoo_cache is operating, and as far as I can tell, it's not really the solution you were looking for.

As far as knowing whether a request is being pulled from cache or not, I have a simple solution:

class CachedLimiterSession(CacheMixin, LimiterMixin, Session):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.is_cached = False

    def get(self, url: str, params=None, **kwargs):
        expiration = get_expiration_from_interval(params['interval'])
        res = super().get(url, params=params,
                          expire_after=expiration, **kwargs)

        self.is_cached = res.from_cache # res.from_cache returns true if the request is being pulled from cache
        return res

session = CachedLimiterSession(
    limiter=Limiter(RequestRate(2, Duration.SECOND*5)),
    bucket_class=MemoryQueueBucket,
    backend=SQLiteCache("yfinance.cache"),
    methods_allowed=['GET']
)


def get_apple(session: CachedLimiterSession):
    yf.download("AAPL", interval="1d", period="1y", session=session)
    print("Is cached: ", session.is_cached)


get_apple(session)
get_apple(session)

Output

[*********************100%%**********************]  1 of 1 completed
Is cached:  False
[*********************100%%**********************]  1 of 1 completed
Is cached:  True

However, the expiration for the request IS NOT being set properly, in fact it is None. I am trying to figure that out.

0 replies

kschmid · 2023-08-17T06:15:14Z

kschmid
Aug 17, 2023
Author

Dear Rick, thanks for this help. I was looking into this mostly for debugging reasons and for this it should be fine.

7 replies

kschmid Aug 19, 2023
Author

Ok, here is what I have done: tried it also on a MacBook AIR (M1) as well as the MacBook Pro, which is where all previous attempts were. On the Air, I had first to download the relevant packages.

If comment out the expiration part in get, then it works, if not then not (on both).
Apart from the output in the first case (print(history)) is longer, there is not much of a difference.

When I open yfinance.cache, it looks like this (MacBook Pro)

546fb117ac90b0b9|���v|
e614d38fc8959c26|��E|

Hence, even the keys are identical to yours, but I only get binary and not the nice column-headers.
On the MB Air, it is different keys! and the rest is only binary as well. The schema is shown as:

sqlite> .fullschema --indent
CREATE TABLE responses(key TEXT PRIMARY KEY, value BLOB, expires INTEGER);
CREATE INDEX expires_idx ON responses(expires);
CREATE TABLE redirects(key TEXT PRIMARY KEY, value BLOB, expires INTEGER);
/* No STAT tables available */

rickturner2001 Aug 19, 2023

Good, as far as the SQL output it seems that the responses are being cached correctly (I had to install dbeaver to actually see the binary values and get the nice table view)

I was unable to reproduce the errors you are getting, would you mind running this on a docker container?

assuming you are using a virtual environment

pip install yfinance requests_cache pyrate_limiter requests_ratelimiter
pip freeze >> requirements.txt

main.py

import yfinance as yf

from requests_cache import CacheMixin, SQLiteCache
from requests_ratelimiter import LimiterMixin, MemoryQueueBucket
from pyrate_limiter import Duration, RequestRate, Limiter

from requests import Session




def get_expiration_from_interval(interval: str) -> int:
   # ...   
   return 10 


class CachedLimiterSession(CacheMixin, LimiterMixin, Session):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.is_cached = False

    def get(self, url: str, params=None, **kwargs):
        expiration = get_expiration_from_interval(params['interval'])
        res = super().get(url, params=params,
                          expire_after=expiration, **kwargs)

        self.is_cached = res.from_cache
        return res


session = CachedLimiterSession(
    limiter=Limiter(RequestRate(2, Duration.SECOND*5)),
    bucket_class=MemoryQueueBucket,
    backend=SQLiteCache("yfinance.cache"),
    methods_allowed=['GET']
)


def get_ticker_data(session: CachedLimiterSession):
    yf.download("AAPL", interval="1d", period="1y", session=session)
    print("Is cached: ", session.is_cached)


def get_ticker_history(session: CachedLimiterSession):
    ticker = yf.Ticker("AAPL", session=session)
    history = ticker.history()

    print(history)
    print("Is Cached: ", session.is_cached)


get_ticker_history(session)
get_ticker_history(session)

get_ticker_data(session)
get_ticker_data(session)

inside .dockerignore
venv

dockerfile

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . /app
COPY .dockerignore /app/.dockerignore

CMD ["python", "main.py"] # assuming your script is called main.py

Building the image

docker build -t yfinance_cache .
docker run yfinance_cache

at this point you should get the expected output

                                 Open        High  ...  Dividends  Stock Splits
Date                                               ...                         
2023-07-19 00:00:00-04:00  192.839596  197.962667  ...       0.00           0.0
2023-07-20 00:00:00-04:00  194.826901  196.205045  ...       0.00           0.0
...
[23 rows x 7 columns]
Is Cached:  False
                                 Open        High  ...  Dividends  Stock Splits
Date                                               ...                         
2023-07-19 00:00:00-04:00  192.839596  197.962667  ...       0.00           0.0
2023-07-20 00:00:00-04:00  194.826901  196.205045  ...       0.00           0.0
...
[23 rows x 7 columns]
Is Cached:  True
[*********************100%%**********************]  1 of 1 completed
Is cached:  False
[*********************100%%**********************]  1 of 1 completed
Is cached:  True

kschmid Aug 20, 2023
Author

That's interesting: I tried dbeaver-community. (Again Mac)
It looks slightly different from what you posted:

key             |value                                                                                      
----------------+-------------------------------------------------------------------------------------------
546fb117ac90b0b9|    v      } (  _content B p  {"chart":{"result":[{"meta":{"currency":"USD","symbol":"AAPL"
e614d38fc8959c26|   E       } (  _content BG   {"chart":{"result":[{"meta":{"currency":"USD","symbol":"AAPL"

(from the GUI = is there a way to use it in the shell on the mac? According to docu, it should, but I always just get the GUI to start.)

What machine are you using? Mac / Linux (which OS?) / Windows. - I am afraid, it may also make a difference.

Working on the docker version, now.

kschmid Aug 20, 2023
Author

Hm, ok, it seems you are assuming that the underlying image is python:3.9-slim (probably the edgelogistics from docker-hub)?
Unfortunately, I only realized this, when I was all the way through. (I had started with the latest ubuntu as I usually do)
Anyway, even without the last few steps, just from running within the docker image I can confirm that within a fresh install ubuntu, venv virtual environments, it works as expected. Which is both good and bad. I try to isolate this a bit. I am worried it might be platform-specifics.

rickturner2001 Aug 20, 2023

What machine are you using? Mac / Linux (which OS?) / Windows. - I am afraid, it may also make a difference.

I am running Ubuntu on my laptop, but as far as the dbeaver output, I made it pretty for the sake of this discussion, I am also getting the _content gibberish. I am not sure how to get dbeaver working on a CLI :(

kschmid · 2023-08-19T10:05:25Z

kschmid
Aug 19, 2023
Author

I tried this, because I wanted to analyze the caching behavior in more detail. (Note that I did it today, i.e., not on a weekday, which may influence the access behavior.)
However, if I do this, I get the following output:

[*********************100%%**********************]  1 of 1 completed

1 Failed download:
['AAPL']: Exception('%ticker%: No price data found, symbol may be delisted (period=1y)')
Is cached:  False
[*********************100%%**********************]  1 of 1 completed

1 Failed download:
['AAPL']: Exception('%ticker%: No price data found, symbol may be delisted (period=1y)')
Is cached:  False

I could identify the expiration handling in the get method as the culprit. If I modify this:

        # expiration = get_expiration_from_interval(params['interval'])
        res = super().get(url, params=params,
                          # expire_after=expiration, 
                          **kwargs)

This works, but I am not sure regarding other ramifications.

Another strange thing, I observed: if instead of downloading the data directly, I create a ticker and use history, then the first ticker gives false, the second gives true. (data is always true).
Does that mean there is a problem with caching the data or is this just not correctly handled wrt. updating the caching information (e.g., if it does not go through get?)

5 replies

kschmid Aug 19, 2023
Author

I should add: the expires is probably not shown, because it is NULL. This is the code modification I had to do in order to make it run. If I insert it, it throws errors and does not modify / update the cache.

kschmid Aug 20, 2023
Author

Ok, got it.
The first source you did post, was missing:

def get_expiration_from_interval(interval: str) -> int:
   # ...   
   return 10

or an alternative code. (The 10 probably does not make too much sense in general).
Without this, it breaks in ubuntu / docker / venv - or my mac
With it, it works without a flaw also in my standard environment.

it is shame it is not giving more useful error messages (here is the docker-output), but this is the part obviously that was missing.

AAPL: No price data found, symbol may be delisted (period=1mo)
Empty DataFrame
Columns: [Open, High, Low, Close, Adj Close, Volume]
Index: []
Is Cached:  False
AAPL: No price data found, symbol may be delisted (period=1mo)
Empty DataFrame
Columns: [Open, High, Low, Close, Adj Close, Volume]
Index: []
Is Cached:  False
[*********************100%%**********************]  1 of 1 completed

1 Failed download:
['AAPL']: Exception('%ticker%: No price data found, symbol may be delisted (period=1y)')
Is cached:  False
[*********************100%%**********************]  1 of 1 completed

1 Failed download:
['AAPL']: Exception('%ticker%: No price data found, symbol may be delisted (period=1y)')
Is cached:  False

kschmid Aug 20, 2023
Author

Ok. further exploration: the initialization of the ticker (where all base data is effectively downloaded) does not touch modified get method at all. (Thus, after the first Ticker call the is_cached is still "false" from the initialization.) Afterwards, checking this, depends not on the Ticker call, but on whatever download was happening earlier.
The ticker information also seems to not be cached at all, if I interpret the DB content correctly.

rickturner2001 Aug 20, 2023

Yes I apologize, I meant to omit get_expiration_from_interval as a way of saying "implement something that works for you", depending on your caching needs.

the initialization of the ticker (where all base data is effectively downloaded) does not touch modified get method at all. (Thus, after the first Ticker call the is_cached is still "false" from the initialization.)

What makes you say this? The way I understand this is the following:

In TickerBase this is how we get data for a ticker

self.session = session
self._data: TickerData = TickerData(self.ticker, session=session)

#...

get_fn = self._data.get

#...

data = get_fn(
                url=url,
                params=params,
                proxy=proxy,
                timeout=timeout
            )

So then we go into TickerData and we can take a look at the get method

    def get(self, url, user_agent_headers=None, params=None, proxy=None, timeout=30):
        proxy = self._get_proxy(proxy)
        response = self._session.get(
            url=url,
            params=params,
            proxies=proxy,
            timeout=timeout,
            headers=user_agent_headers or self.user_agent_headers)
        return response

if you were to print out session here, you would see that session is an instance of either CachedLimiterSession (following the naming convention in our previous script) or request where request is the requests package, because of this statement

self._session = session or requests

Thus, after the first Ticker call the is_cached is still "false" from the initialization

The first call to is_cached must be falsy, given that the request is not coming from cache. if you were to call main twice in the docker container in the same session, you will get 4 truthy values the second time around. But of course, everytime you end your session, the caching gets wiped out, thus resulting in two True and two False (First request is false, second request is true)

To prove this, you can run docker run -it yfinance_cache /bin/bash and run main.py twice within the container

The first time you run main.py you should expect false true false true

false: First request to ticker history (init cache)
true: second request to ticker history (fetch from cache)
false: first request to yf.download (init cache)
true: second request to yf.download (fetch from cache)

At this point yfinance.cache has been created and is populated with data. Therefore, the second time you run main.py you should expect 4 truthy values

true: first request to ticker history (fetch from cache)
true: second request to ticker history (fetch from cache)
true: first request to yf.download (fetch from cache)
true: second request to yf.download (fetch from cache)

kschmid Aug 21, 2023
Author

Sorry, for not answering sooner.

Yes I apologize, I meant to omit get_expiration_from_interval as a way of saying "implement something that works for you", depending on your caching needs.
I couldn't recognize that this is needed (I am not deep enough into python and the error messages looked like normal mal-functioning.

the initialization of the ticker (where all base data is effectively downloaded) does not touch modified get method at all. (Thus, after the first Ticker call the is_cached is still "false" from the initialization.)

What makes you say this? The way I understand this is the following:

It was because of my experiments. So, in order to clarify, I went back to the docker. modfied the ticker-routine and the calling as follows:

def get_ticker_history(session: CachedLimiterSession):
    ticker = yf.Ticker("AAPL", session=session)
    print("Is Cached: ", session.is_cached)
    history = ticker.history()

#    print(history)
    print("Is Cached: ", session.is_cached)


get_ticker_history(session)
get_ticker_history(session)

I would expect, as you wrote: that

false: first request to ticker
false: first request to history
true: 2nd request to ticker
true: 2nd request to ticker

However what I get (similar to other experiments):

Is Cached:  False
Is Cached:  False
Is Cached:  False
Is Cached:  True

When I rerun (fast enough), I get:

Is Cached:  False
Is Cached:  True
Is Cached:  True
Is Cached:  True

Thus, the initial ticker request is always false.
Moreover, when I add a print statement to the get method, I only get two prints, instead of four.
This, is how I came to my conclusions.

rickturner2001 · 2023-08-21T18:29:06Z

rickturner2001
Aug 21, 2023

Hey @kschmid I hope you're doing great. I have been making some adjustments to CacheLimiterSession and I think you might be interested in taking a look at it, particularly regarding filtering by url_pattern

Class Changes

We have now created this function that returns a regex with a dynamic url (note that tthe url is the same used by the Yfiannce API for prices data)

def get_dynamic_range_interval(range: str, interval: str):
    return re.compile(
        rf"https://query2\.finance\.yahoo\.com/v8/finance/chart/([a-zA-Z]+)\?range={range}&interval={interval}&includePrePost=False&events=div%2Csplits%2CcapitalGains"
    )

We can then use this function to generate url_patterns and give different caching options based on the range-interval combination

url_patterns = {
    get_dynamic_range_interval("1y", "1d"): calculate_market_open(), # you may define calculate_market_open as you please
}

CachedLimitedSession also changed a little bit. We now include the url_patterns and a default cache expiration. We also added a toggle_cache function to toggle the caching between requests

Also, the class is now a Singleton

class SingletonMeta(type):
    _instances = {}

    def __call__(cls, *args, **kwargs):
        if cls not in cls._instances:
            instance = super().__call__(*args, **kwargs)
            cls._instances[cls] = instance
        return cls._instances[cls]

class CachedLimiterSession(
    CacheMixin, LimiterMixin, Session, metaclass=SingletonMeta
):
    def __init__(self, is_test: bool = False, *args, **kwargs):
        super().__init__(
            *args,
            **kwargs,
            limiter=Limiter(RequestRate(2, Duration.SECOND * 5)),
            bucket_class=MemoryQueueBucket,
            backend=SQLiteCache(
                "yfinance.cache" if not is_test else "yfinance_test.cache"
            ),
            methods_allowed=["GET"],
            # Do not cache anything that is not included in url_patterns
            # This is effectively the same as {..., *: DO_NOT_CACHE}
            expire_after=DO_NOT_CACHE,
            urls_expire_after=url_patterns,
        )
        self.is_caching = True
        self.last_request = None

    def get(self, url: str, params=None, **kwargs):
        if self.is_caching:
            res = super().get(url, params=params, **kwargs)
        else:
            with self.cache_disabled():
                res = super().get(url, params=params, **kwargs)

        self.last_request = res
        return res

    def toggle_caching(self):
        self.is_caching = not self.is_caching

Testing

In case you wish to match my config.py

import pathlib

BASE_DIR = pathlib.Path().resolve()
DATABASE_PATH = BASE_DIR / "yfinance.cache"
TEST_DATABASE_PATH = BASE_DIR / "yfinance_test.cache"

import os
import unittest

import yfinance as yf
from config import BASE_DIR
from session import CachedLimiterSession


# AAPL default for testing purposes
def download_ticker_data(
    session: CachedLimiterSession | None = None,
    ticker: str = "AAPL",
):
    yf.download(
        ticker,
        period="1y",
        interval="1d",
        session=session,
    )


def delete_test_cache():
    if os.path.isfile(BASE_DIR / "yfinance_test.cache"):
        os.remove(BASE_DIR / "yfinance_test.cache")


class TestCache(unittest.TestCase):
    def test_global_instance(self):
        session = CachedLimiterSession(is_test=True)
        session2 = CachedLimiterSession(is_test=True)

        self.assertEqual(id(session), id(session2))

        delete_test_cache()

    def test_caching(self):
        session = CachedLimiterSession(is_test=True)

        download_ticker_data(session)
        self.assertEqual(session.last_request.from_cache, False)

        download_ticker_data(session)
        self.assertEqual(session.last_request.from_cache, True)

        delete_test_cache()

    def test_toggle_cache(self):

        session = CachedLimiterSession(is_test=True)

        session.toggle_caching()
        self.assertEqual(session.is_caching, False)

        download_ticker_data(session)
        self.assertEqual(session.last_request.from_cache, False)

        download_ticker_data(session)
        self.assertEqual(session.last_request.from_cache, False)

        delete_test_cache()

    def test_url_patterns(self):

        session = CachedLimiterSession(is_test=True)

        session.get("https://www.google.com")
        self.assertEqual(session.last_request.from_cache, False)

        session.get("https://www.google.com")
        self.assertEqual(session.last_request.from_cache, False)
        delete_test_cache()

1 reply

kschmid Aug 21, 2023
Author

This looks pretty well thought out.
There is only one point, which surprises me a bit: the toggle.
Usually it is much less cumbersome to work with get and set as you may arrive through different paths in the program at this instruction and hence, it is not necessary clear what the current state of the toggle is. Of course one can work around that by creating an adapter class that traces the state in the app, but I would still consider it an unnecessary indirection.

ValueRaider · 2023-08-21T19:48:53Z

ValueRaider
Aug 21, 2023
Collaborator

@rickturner2001 @kschmid
I see you 2 doing a lot of back and forth, which is great but consider moving the code into a fork/repository, it's getting very verbose.

0 replies

Feature proposal: improved caching support #1665

Replies: 9 comments · 21 replies

kschmid Aug 14, 2023 Author

ValueRaider Aug 14, 2023 Collaborator

kschmid Aug 15, 2023 Author

ValueRaider Aug 15, 2023 Collaborator

kschmid Aug 15, 2023 Author

kschmid Aug 16, 2023 Author

kschmid Aug 17, 2023 Author

kschmid Aug 19, 2023 Author

kschmid Aug 20, 2023 Author

kschmid Aug 20, 2023 Author

kschmid Aug 19, 2023 Author

kschmid Aug 19, 2023 Author

kschmid Aug 20, 2023 Author

kschmid Aug 20, 2023 Author

kschmid Aug 21, 2023 Author

Class Changes

Testing

kschmid Aug 21, 2023 Author

ValueRaider Aug 21, 2023 Collaborator

Replies: 9 comments 21 replies

kschmid
Aug 14, 2023
Author

ValueRaider Aug 14, 2023
Collaborator

kschmid Aug 15, 2023
Author

ValueRaider Aug 15, 2023
Collaborator

kschmid Aug 15, 2023
Author

kschmid Aug 16, 2023
Author

kschmid
Aug 17, 2023
Author

kschmid Aug 19, 2023
Author

kschmid Aug 20, 2023
Author

kschmid Aug 20, 2023
Author

kschmid
Aug 19, 2023
Author

kschmid Aug 19, 2023
Author

kschmid Aug 20, 2023
Author

kschmid Aug 20, 2023
Author

kschmid Aug 21, 2023
Author

kschmid Aug 21, 2023
Author

ValueRaider
Aug 21, 2023
Collaborator