This a project to demonstrate the use of standard python libraries like os, urllib, HTMLParser to create a minimalist webpage crawler that crawls webpages on a website to gather hyperlinks (URLs)
- general.py
- link_finder.py
- domain.py
- spider.py
- main.py
Step 1: Make the necessary changes
Navigate to file main.py
and change the following lines
- To create a new project
PROJECT_NAME = 'TEST'
Change the project name fromTEST
toYOUR_PROJECT_NAME
- To add the homepage URL
HOMEPAGE = 'https://example.com/'
Change the link fromhttps://example.com/
toTARGET_WEBSITE_HOMEPAGE_LINK
{make sure to use the format as http(s)://abc.xyz} - To change number of workers (Threads)
Change this variableNUMBER_OF_THREADS = 8
toYOUR_THREADS
{Default is 8}
Step 2: Run the crawler
- Using Python 2:
python main.py
- Using Python 3:
python3 main.py
1. general.py - url
This file contains various supporting functions that we will be using for our website crawler
2. link_finder.py - url
This file contains various supporting parsing functions using 'HTMLParser' and 'urllib' python modules that we will be using for our website crawler
3. domain.py - url
This file acts as a helper by providing functions to get the domain name and sub-domain name
4. spider.py - url
The actual Spider class that performs all the webpage crawling functions
5. main.py - url
This is the main file that creates multiple threads and assigns jobs to these threads. This file also is the starting point of the project's execution