Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FBref] Change how data is cached to reduce the space required #682

Open
lorenzodb1 opened this issue Aug 19, 2024 · 2 comments
Open

[FBref] Change how data is cached to reduce the space required #682

lorenzodb1 opened this issue Aug 19, 2024 · 2 comments
Labels
enhancement New feature or request FBref Issue or pull request related to the FBref scraper

Comments

@lorenzodb1
Copy link
Contributor

Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.

To improve this, we could store the DataFrame with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.

@probberechts
Copy link
Owner

Thanks for your suggestion.

I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. This blog post lists a number of advantages, fyi.

We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the , <script>, etc.) and by storing them in a compressed format. However, this is not a high priority for me. My storage requirements are << 500MB.

I assume you must have scraped a huge part of the FBref website to end up with 20GB. That is a use case that I (ethically) actually prefer not to support at all.

@probberechts probberechts added enhancement New feature or request FBref Issue or pull request related to the FBref scraper labels Aug 20, 2024
@lorenzodb1
Copy link
Contributor Author

Thank you for sharing that blog post!

I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by soccerdata and won't be changing how the data is parsed.

I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all.

Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request FBref Issue or pull request related to the FBref scraper
Projects
None yet
Development

No branches or pull requests

2 participants