Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local file path & Raw HTML string support. #16

Closed
wants to merge 5 commits into from
Closed

Add local file path & Raw HTML string support. #16

wants to merge 5 commits into from

Conversation

timothycarambat
Copy link
Contributor

@timothycarambat timothycarambat commented Apr 25, 2024

Added support for loading of local HTML files for parsing and LLM generation.

Requires new local HTML parser node-html-parser

resolves #15
resolves #10

@luminous8
Copy link

luminous8 commented Apr 26, 2024

🙏 man you're fast!

Quick question, is it possible with your update to pass raw html not stored in fs
Doing something like

const response = await fetch('https://news.ycombinator.com/');
const rawHtml = await response.json();

// Run the scraper
const pages = await scraper.runFiles([rawHtml], {
  model: 'gpt-4-turbo',
  schema,
  mode: 'html',
  closeOnFinish: true,
})

instead of

// Grab today's HN front page to run the example
await fetch('https://news.ycombinator.com/')
  .then((res) => res.text())
  .then((html) => writeFileSync(HNRawHtmlPath, html, { encoding: 'utf-8', flag: 'w' }))
  .catch((e) => {
    console.error("Failed to fetch content from Hackernews", e)
  })

// Local file paths to scrape - will be loaded from local filepaths.
const filePaths = [HNRawHtmlPath]

// Run the scraper
const pages = await scraper.runFiles(filePaths, {
  model: 'gpt-4-turbo',
  schema,
  mode: 'html',
  closeOnFinish: true,
})

Thanks

@timothycarambat
Copy link
Contributor Author

@luminous8 - added. IDK if maintainer will merge this, but ed45843 accomplishes that

@luminous8
Copy link

Awesome! 🤞

@timothycarambat timothycarambat changed the title Add local file paths instead of URLS Add local file path & Raw HTML string support. Apr 26, 2024
@mishushakov
Copy link
Owner

Hey, thank you so much for your contribution 🙏
Unfortunately, I cannot merge this at a moment as I'm planning making the library independent of Playwright.

For bigger contributions like this, we should discuss the implementation first (in an issue) and then proceed to pull request
Still appreciate and very thankful for all the effort you put into making this 😄

@timothycarambat timothycarambat deleted the local-file-parse branch May 1, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use LLM without Playwright Allow passing playwright pages instead of URLs
3 participants