Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: web scraper #210

Merged
merged 9 commits into from
Aug 23, 2023
Merged

tools: web scraper #210

merged 9 commits into from
Aug 23, 2023

Conversation

Struki84
Copy link
Contributor

@Struki84 Struki84 commented Jul 21, 2023

Added a simple web scraper tool, you can set the scraping depth when creating the tool, default is 1. The tool will scrap all the links it finds on the domain, will not follow outside links. Depending on the size of the website it might take a while.

Will return a string containing all the content it found, meta data, list of links and scraped links. Will comply with robots.txt

PR Checklist

  • Read the Contributing documentation.
  • Read the Code of conduct documentation.
  • Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
  • Describes the source of new concepts.
  • References existing implementations as appropriate.
  • Contains test coverage for new functions.
  • Passes all golangci-lint checks.

@Struki84 Struki84 changed the title tools/web scraper tools: web scraper Jul 21, 2023
@Struki84
Copy link
Contributor Author

@tmc @FluffyKebab any feedback on this?

@tmc
Copy link
Owner

tmc commented Aug 11, 2023

Can you make the options configurable (not just depth).

@Struki84
Copy link
Contributor Author

@tmc is this better?

@Struki84
Copy link
Contributor Author

Struki84 commented Aug 15, 2023

@tmc bump ^^

Copy link
Owner

@tmc tmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tmc tmc merged commit 43b988b into tmc:main Aug 23, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants