You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.
To improve this, we could store the DataFrame with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.
The text was updated successfully, but these errors were encountered:
I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. This blog post lists a number of advantages, fyi.
We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the , <script>, etc.) and by storing them in a compressed format. However, this is not a high priority for me. My storage requirements are << 500MB.
I assume you must have scraped a huge part of the FBref website to end up with 20GB. That is a use case that I (ethically) actually prefer not to support at all.
I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by soccerdata and won't be changing how the data is parsed.
I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all.
Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?
Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.
To improve this, we could store the
DataFrame
with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.The text was updated successfully, but these errors were encountered: