Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch only the required subtree of the website instead of the whole site #102

Open
wants to merge 1 commit into
base: flexbooks
Choose a base branch
from

Conversation

benoit74
Copy link
Contributor

Changes:

  • instead of fetching the whole website tree at scraper start, we can fetch only the required subtree
    • for this we need to go up to the book cover page if we are inside a book
    • and this means that we need to be able to search for cover page without yet having LibraryPage objects, but only the page id as a string
  • instead of raising error when searching for cover page and we are already "above" the book level, return None for more obvious handling

@benoit74 benoit74 added this to the 0.1 milestone Nov 27, 2024
@benoit74 benoit74 self-assigned this Nov 27, 2024
Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 18.96552% with 47 lines in your changes missing coverage. Please review.

Project coverage is 43.66%. Comparing base (d2a5474) to head (d665d49).

Files with missing lines Patch % Lines
scraper/src/mindtouch2zim/client.py 15.55% 38 Missing ⚠️
...src/mindtouch2zim/libretexts/detailed_licensing.py 25.00% 3 Missing ⚠️
scraper/src/mindtouch2zim/libretexts/index.py 25.00% 3 Missing ⚠️
scraper/src/mindtouch2zim/processor.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           flexbooks     #102      +/-   ##
=============================================
- Coverage      44.55%   43.66%   -0.90%     
=============================================
  Files             18       19       +1     
  Lines           1102     1136      +34     
  Branches         154      163       +9     
=============================================
+ Hits             491      496       +5     
- Misses           599      628      +29     
  Partials          12       12              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review November 27, 2024 10:40
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ; see name change suggestion

@@ -0,0 +1,4 @@
class BadBookPageError(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BadBookPage implies that the Book page as an issue. In this case, the problem is more with the request if I understand correctly. If so, *IncorectBookPage` seems more appropriate.

if raw_tags is None:
raise MindtouchParsingError(f"No tags property for page {page_id}")
raw_tag = raw_tags.get("tag", None)
if raw_tag is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant we combine those two conditions into one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants