Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: cannot access local variable 'browser' where it is not associated with a value #795

Open
aleenprd opened this issue Nov 11, 2024 · 4 comments

Comments

@aleenprd
Copy link

Describe the bug
Whereas locally my code works on the bare machine and in a docker container, in Kubernetes, I get some very weird errors related to " cannot access local variable 'browser' where it is not associated with a value"

To Reproduce
Steps to reproduce the behavior:

  • I guess just run a simple scrape of a page in a pod??

Expected behavior
Whereas locally my code works on the bare machine and in a docker container, in Kubernetes, I get some very weird errors.

Desktop (please complete the following information):

  • OS: Ubuntu. This is a Docker container running in a Kubernetes pod.
  • Browser: headless browser.

Additional context
Here are logs:

[2024-11-11, 15:30:56 UTC] {pod_manager.py:418} INFO - [base] Attempt 1 failed:
[2024-11-11, 15:30:56 UTC] {pod_manager.py:418} INFO - [base] 2024-11-11 15:30:56,719 - ERROR - **Error scraping swim-spa-abdeckung.de. Error: cannot access local variable 'browser' where it is not associated with a value**
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base] Future exception was never retrieved
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base] future: <Future finished exception=TargetClosedError('Target page, context or browser has been closed')>
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base] Traceback (most recent call last):
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 518, in wrap_api_call
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]     return await cb()
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]            ^^^^^^^^^^
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 85, in inner_send
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]     callback = self._connection._send_message_to_server(
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 322, in _send_message_to_server
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base]     raise self._closed_error
[2024-11-11, 15:31:05 UTC] {pod_manager.py:418} INFO - [base] playwright._impl._errors.TargetClosedError: Target page, context or browser has been closed
@aleenprd
Copy link
Author

I noticed this becomes an issue when using ThreadPoolExecutor. Apparently, the error comes from playright's use of asyncio. I still don't understand why it works locally but not in K8s.

[2024-11-11, 17:46:09 UTC] {pod_manager.py:418} INFO - [base] 2024-11-11 17:46:09,794 - INFO - Testing AI scraper with multiple domains and threading...
[2024-11-11, 17:46:09 UTC] {pod_manager.py:418} INFO - [base] 2024-11-11 17:46:09,794 - INFO - Batch 1: len 2, indexes 0:2...
[2024-11-11, 17:46:21 UTC] {pod_manager.py:418} INFO - [base] Attempt 1 failed:
[2024-11-11, 17:46:33 UTC] {pod_manager.py:418} INFO - [base] 2024-11-11 17:46:33,832 - WARNING - Setting up local directories...
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base] Future exception was never retrieved
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base] future: <Future finished exception=TargetClosedError('Target page, context or browser has been closed')>
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base] Traceback (most recent call last):
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 518, in wrap_api_call
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]     return await cb()
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]            ^^^^^^^^^^
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 85, in inner_send
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]     callback = self._connection._send_message_to_server(
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 322, in _send_message_to_server
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base]     raise self._closed_error
[2024-11-11, 17:46:36 UTC] {pod_manager.py:418} INFO - [base] playwright._impl._errors.TargetClosedError: Target page, context or browser has been closed

@f-aguzzi
Copy link
Member

There seems to be a recurrent issue with nested asyncio calls. It usually pops up in Jupyter Notebook environments, but it be the same with ThreadPoolExecutor. Try to install the nest_asyncio package and put this at the beginning of your script:

import nest_asyncio
nest_asyncio.apply()

@aleenprd
Copy link
Author

There seems to be a recurrent issue with nested asyncio calls. It usually pops up in Jupyter Notebook environments, but it be the same with ThreadPoolExecutor. Try to install the nest_asyncio package and put this at the beginning of your script:

import nest_asyncio
nest_asyncio.apply()

That didn't work for me and it doesn't make sense why it would work on my machine or in docker but not K8s.

The code:

def execute_concurrently(
    logger,
    s3_handler,
    func: callable,
    kwargs: dict,
    directory: str,
    bucket_name: str,
    objects: list[dict],
    batch_size: int,
    clean_local_storage: bool = False,
    sleep_time: int = 3,
):
    nest_asyncio.apply()
    
    if args.mode in ["ping_http_status", "ai_scrape", "test_ai_scrape"]:
        obj_key = "domain"
    else:
        obj_key = "obj"

    timestamp_format = "%Y_%m_%d_T%H_%M_%SZ"
    len_objects = len(objects)
    batch_no = 1
    runtimes = []

    # We will process the objects in batches and within each batch, we will process them concurrently
    for i in range(0, len_objects, batch_size):
        sleep_time = 0 if batch_no == 1 else sleep_time
        sleep(sleep_time)

        t0 = time()

        start = i
        end = min(i + batch_size, len_objects)
        batch = objects[start:end]
        logger.info(f"Batch {batch_no}: len {len(batch)}, indexes {start}:{end}...")

        kwargs_list_batch = [kwargs | {obj_key: obj} for obj in batch]
        logger.debug(f"Batch {batch_no} kwargs: {kwargs_list_batch}")

        batch_results = []
        with concurrent.futures.ThreadPoolExecutor(max_workers=None) as executor:
            futures = get_futures(executor, func, kwargs_list_batch)

            for future in concurrent.futures.as_completed(futures):
                batch_results.append(future.result())

        # Determine the filepath to save the batch results and save them locally
        current_timestamp_utc = Chronos.get_current_datetime_utc(
            format=timestamp_format
        )
        filename = f"{args.mode}_{args.stage}_worker_{args.worker_number}_batch_{batch_no}_{current_timestamp_utc}.json"
        filepath = f"{directory}/{filename}"
        save_results_json_newline(data=batch_results, path=filepath, el_timestamp=None)

        # After this, upload the batch results to S3 then clear the local storage (optional)
        s3_handler.upload_file(
            file_name=filepath,
            bucket_name=bucket_name,
            object_name=filepath,
        )
        if clean_local_storage:
            os.remove(filepath)

        batch_no += 1

        t1 = time()
        runtime = round(t1 - t0, 2)
        runtimes.append(runtime)
        if batch_no % 10 == 0:
            logger.info(
                f"Avg batch runtime: {round(sum(runtimes) / len(runtimes), 2)}s"
            )
called like:
        logger.info("Testing AI scraper with multiple domains and threading...")
        domains = ["boligportal.dk", "totteland.dk"]
        execute_concurrently(
            logger=logger,
            s3_handler=s3_handler,
            func=domain_scraper.ss.scrape,
            kwargs={"prompt_id": "test", "prompt": "Provide a summary of the page."},
            directory=_dir,
            bucket_name=bucket_name,
            objects=domains,
            batch_size=2,
            clean_local_storage=args.clean_local_storage_middle,
        )

where the callable is:

class SmartScraperWrapper:
    """A wrapper class for the SmartScraperGraph class from scrapegraphai."""

    def __init__(
        self,
        logger: Toth,
        api_key: str = os.getenv("OPENAI_API_KEY"),
        model: str = os.getenv("OPENAI_SCRAPER_MODEL", "openai/gpt-4o-mini"),
        verbose: bool = False,
        headless: bool = True,
    ):
        self.logger = logger
        self.api_key = api_key
        self.model = model
        self.graph_config = {
            "llm": {
                "api_key": self.api_key,
                "model": self.model,
            },
            "verbose": verbose,
            "headless": headless,
            "temperature": 0.0,
        }

    def scrape(self, domain: str, prompt_id: str, prompt: str = None) -> Dict[str, str]:
        """Scrape a source using the SmartScraperGraph class.

        Args:
        - domain (str): a domain string to be scraped.
        - prompt_id (str): a prompt ID.
        - prompt (str): a prompt to be used in the scraping process.

        Returns:
        Dict[str, str]: a dictionary with the scraped data plus metadata.
        """
        source = domain_to_url(domain) 
        scraper = SmartScraperGraph(
            prompt=prompt, source=source, config=self.graph_config
        )
        current_timestamp_utc = Chronos.get_current_datetime_utc()
        try:
            result = scraper.run()
        except Exception as e:
            self.logger.error(f"Error scraping {domain}. Error: {e}")
            return None
        
        result = replace_na_with_none(result)
        result["DOMAIN"] = domain
        result["TIMESTAMP_UTC"] = current_timestamp_utc
        result["ID"] = hashlib.sha256(
            f"{domain}-{current_timestamp_utc}".encode()
        ).hexdigest()
        result["MODEL"] = self.model
        result["PROMPT_ID"] = prompt_id

        return result

resulting in the following issue:

[2024-11-13, 09:52:52 UTC] {pod_manager.py:418} INFO - [base] 2024-11-13 09:52:52,136 - DEBUG - Batch 1 kwargs: [{'prompt_id': 'test', 'prompt': 'Provide a summary of the page.', 'domain': 'boligportal.dk'}, {'prompt_id': 'test', 'prompt': 'Provide a summary of the page.', 'domain': 'totteland.dk'}]
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base] Attempt 1 failed:
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base] Attempt 1 failed:
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base] Future exception was never retrieved
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base] future: <Future finished exception=TargetClosedError('Target page, context or browser has been closed')>
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base] Traceback (most recent call last):
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 518, in wrap_api_call
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]     return await cb()
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]            ^^^^^^^^^^
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 85, in inner_send
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]     callback = self._connection._send_message_to_server(
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 322, in _send_message_to_server
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base]     raise self._closed_error
[2024-11-13, 09:53:03 UTC] {pod_manager.py:418} INFO - [base] playwright._impl._errors.TargetClosedError: Target page, context or browser has been closed
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base] Future exception was never retrieved
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base] future: <Future finished exception=TargetClosedError('Target page, context or browser has been closed')>
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base] Traceback (most recent call last):
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 518, in wrap_api_call
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]     return await cb()
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]            ^^^^^^^^^^
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 85, in inner_send
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]     callback = self._connection._send_message_to_server(
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 322, in _send_message_to_server
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base]     raise self._closed_error
[2024-11-13, 09:53:05 UTC] {pod_manager.py:418} INFO - [base] playwright._impl._errors.TargetClosedError: Target page, context or browser has been closed
[2024-11-13, 09:53:15 UTC] {pod_manager.py:418} INFO - [base] Attempt 2 failed:
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base] Future exception was never retrieved
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base] future: <Future finished exception=TargetClosedError('Target page, context or browser has been closed')>
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base] Traceback (most recent call last):
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 518, in wrap_api_call
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]     return await cb()
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]            ^^^^^^^^^^
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 85, in inner_send
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]     callback = self._connection._send_message_to_server(
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 322, in _send_message_to_server
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base]     raise self._closed_error
[2024-11-13, 09:53:16 UTC] {pod_manager.py:418} INFO - [base] playwright._impl._errors.TargetClosedError: Target page, context or browser has been closed

@aleenprd
Copy link
Author

aleenprd commented Nov 13, 2024

@f-aguzzi I don't mean to really do it this way. I just want to run the scraper as fast as possible, using whatever supported and stable method of doing so. If you could guide me towards a way that ensures high concurrency, I would be very thankful.

Edit: actually if I now add that line about nested async to my code, it will break locally the same way it does in my pod. Weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants