Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking support for XMLScraperGraph #800

Open
Etherealspringfall opened this issue Nov 15, 2024 · 6 comments
Open

Chunking support for XMLScraperGraph #800

Etherealspringfall opened this issue Nov 15, 2024 · 6 comments
Assignees

Comments

@Etherealspringfall
Copy link

Hi, I'm encountering a length limit when using a third party model to extract local html. Can chunking support be added to XMLScraperGraph?

code:

import logging
import os

from langchain_openai import ChatOpenAI
from scrapegraphai.graphs import XMLScraperGraph
from scrapegraphai.utils import prettify_exec_info

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


instance_config = {
    "model": "qwen-turbo",
    "openai_api_base": "https://dashscope.aliyuncs.com/compatible-mode/v1",
    "api_key": ''
}


llm_model_instance = ChatOpenAI(**instance_config)

graph_config = {
    "llm": {
        "model_instance": llm_model_instance,
        "model_tokens": 50000
    },
    "verbose": True
}


def run_scraper():
    try:
        FILE_NAME = "./page.html"
        curr_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(curr_dir, FILE_NAME)


        with open(file_path, 'r', encoding="utf-8") as file:
            text = file.read()

        smart_scraper_graph = XMLScraperGraph(
            prompt="""
            Please do the following:
            1. Parse HTML content
            2. Extract the URLs of all package-related images
            4. The return format is: [{"src":"URL of the picture"}]
            """,
            source=text,  
            config=graph_config
        )

        result = smart_scraper_graph.run()
        logger.info("URL:%s", result)
        graph_exec_info = smart_scraper_graph.get_execution_info()
        logger.info(prettify_exec_info(graph_exec_info))

    except Exception as e:
        logger.error("An error occurred during fetching: %s", e)
        raise e



if __name__ == "__main__":
    images = run_scraper()

error:

openai.BadRequestError: Error code: 400 - {'error': {'code': 'invalid_parameter_error', 'param': None, 'message': '<400> InternalError.Algo.InvalidParameter: Range of input length should be [1, 129024]', 'type': 'invalid_request_error'}, 'id': 'chatcmpl-d23aff7b-baf1-9da3-814d-52d9deb520c8', 'request_id': 'd23aff7b-baf1-9da3-814d-52d9deb520c8'}
@madguy02
Copy link

@VinciGit00 is this something we are looking at, to support?

@VinciGit00
Copy link
Collaborator

Yes please, @madguy02 if you can help us we would be glad

@madguy02
Copy link

Can you assign it to me @VinciGit00

@VinciGit00
Copy link
Collaborator

Hi @madguy02 I assigned it

@madguy02
Copy link

madguy02 commented Nov 19, 2024

So, can you tell me more about the model, @Etherealspringfall , how many tokens does it support? generally for GPT3.5 or GPT4 50000 is well within limits as the error says its supported upto 129024, i tried the code out with gpt4 mini and it does not give me the error above, i think it has to do with the token limits of the model.

Moreoever AFAIK, XMLScrapeGraph is used to scrape .xml files, not sure if .html file is supported in this case(?)

@Etherealspringfall can you run this code and give me the number of tokens in the text you are sending:

import tiktoken

MODEL = "gpt-4"
MAX_TOKENS = 50000
encoding = tiktoken.get_encoding(MODEL)

def count_tokens(text: str) -> int:
    tokens = encoding.encode(text)
    return len(tokens)

@VinciGit00
Copy link
Collaborator

that graph is deprecated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants