Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module 'scrapy_selenium' doesn't define any object named 'SeleniumDownloadHandler' #3

Open
tumregels opened this issue Mar 15, 2020 · 8 comments

Comments

@tumregels
Copy link

In the docs you mention

# You need also to change the default download handlers, like so:
DOWNLOAD_HANDLERS = {
    "http": "scrapy_selenium.SeleniumDownloadHandler",
    "https": "scrapy_selenium.SeleniumDownloadHandler",
}

but in this plugin there is no such SeleniumDownloadHandler.

@tumregels
Copy link
Author

tumregels commented Mar 16, 2020

Probably you forgot to update the docs. Also tried

DOWNLOAD_HANDLERS = {
        "http": "scrapy_headless.HeadlessDownloadHandler",
        "https": "scrapy_headless.HeadlessDownloadHandler",
}

It still fails

Traceback (most recent call last):
  File "venv/scrapex/src/scrapy-selenium/scrapy_headless/downloader.py", line 82, in get_driver
    driver = self._data.driver
AttributeError: '_thread._local' object has no attribute 'driver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "venv/scrapex/lib/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "venv/scrapex/lib/python3.6/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "venv/scrapex/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "venv/scrapex/lib/python3.6/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "venv/scrapex/lib/python3.6/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "venv/scrapex/lib/python3.6/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "venv/scrapex/lib/python3.6/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "venv/scrapex/src/scrapy-selenium/scrapy_headless/downloader.py", line 65, in process_request
    driver = self.get_driver(spider)
  File "venv/scrapex/src/scrapy-selenium/scrapy_headless/downloader.py", line 85, in get_driver
    command_executor=self.grid_url, desired_capabilities=self.capabilities
  File "venv/scrapex/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "venv/scrapex/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "venv/scrapex/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "venv/scrapex/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 208, in check_response
    raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: <!DOCTYPE html>
...

@Mikhail010
Copy link

@tumregels Did you found a solution for this?

@tumregels
Copy link
Author

@Mikhail010 tried my best but failed

@CatSirSir
Copy link

excuse me =-= I don't want to use splash,and I found this plugin,how does it work well? because I see the latest commit is one years ago, a little afraid to use……

@Mikhail010
Copy link

@tumregels I got it working with your suggestion and setting everything inside the spider class like custom settings.

class MySpider(scrapy.Spider):
name = 'myspider'

custom_settings = {
    'SELENIUM_GRID_URL': 'http://127.0.0.1:4444/wd/hub',  # Example for local grid with docker-compose
    'SELENIUM_NODES': 1,  # Number of nodes(browsers) you are running on your grid
    'SELENIUM_CAPABILITIES': {
        "browserName": "chrome",
        "version": "",
        "platform": "ANY",
        "acceptInsecureCerts": True
    },
    'DOWNLOAD_HANDLERS': {
        "http": "scrapy_headless.HeadlessDownloadHandler",
        "https": "scrapy_headless.HeadlessDownloadHandler",
    },
    'SELENIUM_PROXY': 'http://docker.for.mac.host.internal:24000'
}

...

@Mikhail010
Copy link

@CatSirSir I struggled to get it working but after that it has being working fine. Take into account that I started using it last week so I cannot give you a broad opinion. If I were you I would give it a try, I also use splash but It is giving me troubles with sites with angular and other with the ability to detect headless browsers.

@BruceDone
Copy link

BruceDone commented Apr 28, 2021

@Mikhail010 hi , i fix it by this pr #6

@kuzovkov
Copy link

kuzovkov commented Oct 13, 2021

Thanks @Mikhail010

I get it working with such configs:
docker-compose.yml:

  selenium-hub:
    image: selenium/hub
    networks:
      - back
    ports:
      - 4444:4444

  chrome:
    image: selenium/node-chrome
    links:
      - selenium-hub:hub
    environment:
      - HUB_PORT_4444_TCP_ADDR=selenium-hub:4444/grid/register/
      - GRID_TIMEOUT=180 # Default timeout is 30s might be low for Selenium
    volumes:
      - /dev/shm:/dev/shm
    networks:
      - back


networks:
  back:
    driver: bridge

In spider:

from scrapy_headless import HeadlessRequest
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


class TestSpider(scrapy.Spider):
    name = 'test'
    custom_settings = {
        'SELENIUM_GRID_URL': 'http://selenium-hub:4444/wd/hub',  # Example for local grid with docker-compose
        'SELENIUM_NODES': 1,  # Number of nodes(browsers) you are running on your grid
        'SELENIUM_CAPABILITIES': DesiredCapabilities.CHROME,
        'DOWNLOAD_HANDLERS': {
            "http": "scrapy_headless.HeadlessDownloadHandler",
            "https": "scrapy_headless.HeadlessDownloadHandler",
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants