Substack Scraper

This scrapes Substack blogs for all their post content and outputs it into raw text files. This project was intended to create neural network training data.

NOTE: This project currently cannot get around subscriber-only Substack articles; it will output the truncated article text along with the subscriber message.

Usage

git clone https://github.com/ivyraine/substack_scraper
cargo run -- -w <BLOGS>

Example:

## Example:
cargo run -- -w "https://substack.thewebscraping.club/ https://etiennefd.substack.com/"

For debug messages, set envvar RUST_LOG=debug

Contributing

Feel free to open an issue or PR if you have any suggestions or improvements, but I cannot guarantee that I'll get to them! The project is small and has some documentation, so I would encourage putting up a PR if you have a feature you want to add.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Substack Scraper

Usage

Contributing

About

Releases

Packages

Contributors 2

Languages

License

bytewife/substack_scraper

Folders and files

Latest commit

History

Repository files navigation

Substack Scraper

Usage

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages