[FBref] Change how data is cached to reduce the space required #682

lorenzodb1 · 2024-08-19T17:22:44Z

Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.

To improve this, we could store the DataFrame with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.

The text was updated successfully, but these errors were encountered:

probberechts · 2024-08-20T10:16:46Z

Thanks for your suggestion.

I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. This blog post lists a number of advantages, fyi.

We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the , <script>, etc.) and by storing them in a compressed format. However, this is not a high priority for me. My storage requirements are << 500MB.

I assume you must have scraped a huge part of the FBref website to end up with 20GB. That is a use case that I (ethically) actually prefer not to support at all.

lorenzodb1 · 2024-08-20T23:39:59Z

Thank you for sharing that blog post!

I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by soccerdata and won't be changing how the data is parsed.

I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all.

Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?

probberechts added enhancement New feature or request FBref Issue or pull request related to the FBref scraper labels Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FBref] Change how data is cached to reduce the space required #682

[FBref] Change how data is cached to reduce the space required #682

lorenzodb1 commented Aug 19, 2024

probberechts commented Aug 20, 2024

lorenzodb1 commented Aug 20, 2024

[FBref] Change how data is cached to reduce the space required #682

[FBref] Change how data is cached to reduce the space required #682

Comments

lorenzodb1 commented Aug 19, 2024

probberechts commented Aug 20, 2024

lorenzodb1 commented Aug 20, 2024