Skip to content

octopusthu/xuexi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

xuexi

Requirements

  • The main requirement is to scrape the contents of all articles of a certain column (in fact, from its sub columns) and save them as human-readable documents.

  • Possible future requirements are to scrape other similar/dissimilar columns, update the scraped contents (as new articles are posted every day), convert the scraped contents to other formats, etc.

Design

  1. Scrape from the column index the addresses of all sub column indexes
  2. Scrape from the sub column indexes the addresses of all articles (article lists may span multiple pages)
  3. Scrape article contents
  4. Save article metadata (url, title, time, source)
  5. Save article contents as documents

Samples

Site architecture of xuexi

  • All pages are of the same url form: ${base-url}/${page-id}/${template-id}.html

Implementation

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published