Skip to content

Latest commit

 

History

History
44 lines (35 loc) · 1.8 KB

README.md

File metadata and controls

44 lines (35 loc) · 1.8 KB

Tarantula - A python based website crawler!

This a project to demonstrate the use of standard python libraries like os, urllib, HTMLParser to create a minimalist webpage crawler that crawls webpages on a website to gather hyperlinks (URLs)

Files in Project:

  • general.py
  • link_finder.py
  • domain.py
  • spider.py
  • main.py

How to run:

Step 1: Make the necessary changes
Navigate to file main.py and change the following lines

  • To create a new project
    PROJECT_NAME = 'TEST'
    Change the project name from TEST to YOUR_PROJECT_NAME
  • To add the homepage URL
    HOMEPAGE = 'https://example.com/'
    Change the link from https://example.com/ to TARGET_WEBSITE_HOMEPAGE_LINK {make sure to use the format as http(s)://abc.xyz}
  • To change number of workers (Threads)
    Change this variable NUMBER_OF_THREADS = 8 to YOUR_THREADS {Default is 8}

Step 2: Run the crawler

  • Using Python 2:
    python main.py
  • Using Python 3:
    python3 main.py

File Descriptions:

1. general.py - url

This file contains various supporting functions that we will be using for our website crawler

2. link_finder.py - url

This file contains various supporting parsing functions using 'HTMLParser' and 'urllib' python modules that we will be using for our website crawler

3. domain.py - url

This file acts as a helper by providing functions to get the domain name and sub-domain name

4. spider.py - url

The actual Spider class that performs all the webpage crawling functions

5. main.py - url

This is the main file that creates multiple threads and assigns jobs to these threads. This file also is the starting point of the project's execution