pip install proxy-crawler/
Usage
import requests
from scrapeproxy import proxies
# example of filters
# proxy = proxies.get_proxies(limit=1, filter_sources=["best", "hmn", "pub"])
proxy = proxies.get_proxies()
requests.get(url, proxies=proxy[0]["proxy"])
...
proxy output example (info fields are custom for each source)
[{'info': {'IP Address': '35.221.107.127',
'Port': '3128',
'Code': 'US',
'Country': 'United States',
'Anonymity': 'anonymous',
'Google': 'no',
'Https': 'yes',
'Last Checked': '1 minute ago'},
'proxy': {'http': '35.221.107.127:3128', 'https': '35.221.107.127:3128'},
'proxy_string': '35.221.107.127:3128',
'source': 'https://www.us-proxy.org/'}]
NOTE : by default it holds cache(session-wide, so new import new cache) and does not return the same proxy multiple times,
but you can clear cache with empty_cache
param, for more details typeproxies.get_proxies?
NOTE : we are trying to crawl large number of proxies so the quality may be not the best, we suggest to set short timeouts (ex. 3 seconds) and use that to ignore the proxies which are too slow or unresponsive.
Check this file
Add more sources for proxies.
- Add file for your source in handlers folder.
- Write class with method
get_proxy_list
, with params(limit=-1, anonymous=True, https=True, google=False)
which returns list of dicts with format.
{
"info" : dict with info about proxy,
"proxy" : dict with http and https keys
"proxy_string" : string with format host:port
"source" : source of proxy
}
- Go to
proxy_sources.py
and add your source with same format inproxy_sources
list. [NOTE] order is very important and denotes importance and qualiry of source. - Nothing else, thanks ;).
- One more step, please write docstrings and comment your code.
- We have decided to try having proxy pool instead of module (to easier overcome the limits)
- add more sources and maybe verify them
- test and compare proxy-pool vs proxy-module approaches