-
The main requirement is to scrape the contents of all articles of a certain column (in fact, from its sub columns) and save them as human-readable documents.
-
Possible future requirements are to scrape other similar/dissimilar columns, update the scraped contents (as new articles are posted every day), convert the scraped contents to other formats, etc.
- Scrape from the column index the addresses of all sub column indexes
- Scrape from the sub column indexes the addresses of all articles (article lists may span multiple pages)
- Scrape article contents
- Save article metadata (url, title, time, source)
- Save article contents as documents
- Column index
- Sub column index
- Article
- All pages are of the same url form: ${base-url}/${page-id}/${template-id}.html
- ${base-url} = https://www.xuexi.cn/
- The column index
- Misc