Chunking support for XMLScraperGraph #800

Etherealspringfall · 2024-11-15T06:28:40Z

Hi, I'm encountering a length limit when using a third party model to extract local html. Can chunking support be added to XMLScraperGraph?

code:

import logging
import os

from langchain_openai import ChatOpenAI
from scrapegraphai.graphs import XMLScraperGraph
from scrapegraphai.utils import prettify_exec_info

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


instance_config = {
    "model": "qwen-turbo",
    "openai_api_base": "https://dashscope.aliyuncs.com/compatible-mode/v1",
    "api_key": ''
}


llm_model_instance = ChatOpenAI(**instance_config)

graph_config = {
    "llm": {
        "model_instance": llm_model_instance,
        "model_tokens": 50000
    },
    "verbose": True
}


def run_scraper():
    try:
        FILE_NAME = "./page.html"
        curr_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(curr_dir, FILE_NAME)


        with open(file_path, 'r', encoding="utf-8") as file:
            text = file.read()

        smart_scraper_graph = XMLScraperGraph(
            prompt="""
            Please do the following:
            1. Parse HTML content
            2. Extract the URLs of all package-related images
            4. The return format is: [{"src":"URL of the picture"}]
            """,
            source=text,  
            config=graph_config
        )

        result = smart_scraper_graph.run()
        logger.info("URL：%s", result)
        graph_exec_info = smart_scraper_graph.get_execution_info()
        logger.info(prettify_exec_info(graph_exec_info))

    except Exception as e:
        logger.error("An error occurred during fetching: %s", e)
        raise e



if __name__ == "__main__":
    images = run_scraper()

error:

openai.BadRequestError: Error code: 400 - {'error': {'code': 'invalid_parameter_error', 'param': None, 'message': '<400> InternalError.Algo.InvalidParameter: Range of input length should be [1, 129024]', 'type': 'invalid_request_error'}, 'id': 'chatcmpl-d23aff7b-baf1-9da3-814d-52d9deb520c8', 'request_id': 'd23aff7b-baf1-9da3-814d-52d9deb520c8'}

The text was updated successfully, but these errors were encountered:

madguy02 · 2024-11-17T08:46:12Z

@VinciGit00 is this something we are looking at, to support?

VinciGit00 · 2024-11-17T09:03:22Z

Yes please, @madguy02 if you can help us we would be glad

madguy02 · 2024-11-17T09:32:18Z

Can you assign it to me @VinciGit00

VinciGit00 · 2024-11-17T12:46:44Z

Hi @madguy02 I assigned it

madguy02 · 2024-11-19T17:43:55Z

So, can you tell me more about the model, @Etherealspringfall , how many tokens does it support? generally for GPT3.5 or GPT4 50000 is well within limits as the error says its supported upto 129024, i tried the code out with gpt4 mini and it does not give me the error above, i think it has to do with the token limits of the model.

Moreoever AFAIK, XMLScrapeGraph is used to scrape .xml files, not sure if .html file is supported in this case(?)

@Etherealspringfall can you run this code and give me the number of tokens in the text you are sending:

import tiktoken

MODEL = "gpt-4"
MAX_TOKENS = 50000
encoding = tiktoken.get_encoding(MODEL)

def count_tokens(text: str) -> int:
    tokens = encoding.encode(text)
    return len(tokens)

VinciGit00 · 2024-11-25T10:19:55Z

that graph is deprecated

VinciGit00 assigned madguy02 Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking support for XMLScraperGraph #800

Chunking support for XMLScraperGraph #800

Etherealspringfall commented Nov 15, 2024

madguy02 commented Nov 17, 2024

VinciGit00 commented Nov 17, 2024

madguy02 commented Nov 17, 2024

VinciGit00 commented Nov 17, 2024

madguy02 commented Nov 19, 2024 •

edited

Loading

VinciGit00 commented Nov 25, 2024

Chunking support for XMLScraperGraph #800

Chunking support for XMLScraperGraph #800

Comments

Etherealspringfall commented Nov 15, 2024

code:

error:

madguy02 commented Nov 17, 2024

VinciGit00 commented Nov 17, 2024

madguy02 commented Nov 17, 2024

VinciGit00 commented Nov 17, 2024

madguy02 commented Nov 19, 2024 • edited Loading

VinciGit00 commented Nov 25, 2024

madguy02 commented Nov 19, 2024 •

edited

Loading