A configurable web crawler built with Scrapy and Playwright for handling both static and dynamic content. The crawler can process different types of URLs and store results in a PostgreSQL database.
- 🔍 Three types of URL processing:
- Type 0: Direct target URL processing
- Type 1: Static page scanning for target URLs
- Type 2: Dynamic page scanning with depth navigation
- 🎭 Playwright integration for JavaScript-rendered content
- 📊 PostgreSQL storage for crawled URLs and stats
- 🔧 YAML-based configuration
- 📝 Structured logging with Logfire
- 🐳 Docker support
- ☁️ Azure deployment ready with Terraform
- Python 3.11+
- PostgreSQL database
- uv for package management
- Docker (optional)
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request