-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Readable.js on certain pages (like HN) #8
Comments
Looking into this, it's hn's CSP that's blocking loading scripts from skypack. Looking into the console:
So setting the page up with options like: https://playwright.dev/docs/api/class-browser#browser-new-page-option-bypass-csp // Open new page
const page = await browser.newPage({ bypassCSP: true })
await page.goto('https://news.ycombinator.com')
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe('Top 5 stories on Hacker News'),
})
// Run the scraper
const { data } = await scraper.run(page, schema, {
format: 'text',
})
// Show the result from LLM
console.log(data.top) Example output I'd get rn: [
{
title: "Crowdstrike Update: Windows Bluescreen and Boot Loops",
points: 2126,
by: "BLKNSLVR",
commentsURL: "https://reddit.com",
}, {
title: "FCC votes unanimously to dramatically limit prison telecom charges",
points: 293,
by: "Avshalom",
commentsURL: "https://worthrises.org",
}, {
title: "Foliate: Read e-books in style, navigate with ease",
points: 330,
by: "ingve",
commentsURL: "https://johnfactotum.github.io",
}, {
title: "Want to spot a deepfake? Look for the stars in their eyes",
points: 65,
by: "jonbaer",
commentsURL: "https://ras.ac.uk",
}, {
title: "Startups building balloons to hoist tourists 100k feet into the stratosphere",
points: 15,
by: "amichail",
commentsURL: "https://cnbc.com",
}
] |
So I suppose it'd be sufficient to document this behavior as a known constraint on some sites, when using text mode with Readable.js? |
We should get rid of Readable.js in favour of html2text |
No description provided.
The text was updated successfully, but these errors were encountered: