Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Readable.js on certain pages (like HN) #8

Open
mishushakov opened this issue Apr 21, 2024 · 3 comments
Open

Fix Readable.js on certain pages (like HN) #8

mishushakov opened this issue Apr 21, 2024 · 3 comments

Comments

@mishushakov
Copy link
Owner

No description provided.

@blanky0230
Copy link

blanky0230 commented Jul 19, 2024

Looking into this, it's hn's CSP that's blocking loading scripts from skypack. Looking into the console:

Refused to load the script 'https://cdn.skypack.dev/@mozilla/readability' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/". Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.

So setting the page up with options like: { bypassCSP: true } completely resolves the issue. e.g.:

https://playwright.dev/docs/api/class-browser#browser-new-page-option-bypass-csp

// Open new page
const page = await browser.newPage({ bypassCSP: true })
await page.goto('https://news.ycombinator.com')

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

// Run the scraper
const { data } = await scraper.run(page, schema, {
  format: 'text',
})

// Show the result from LLM
console.log(data.top)

Example output I'd get rn:

[
  {
    title: "Crowdstrike Update: Windows Bluescreen and Boot Loops",
    points: 2126,
    by: "BLKNSLVR",
    commentsURL: "https://reddit.com",
  }, {
    title: "FCC votes unanimously to dramatically limit prison telecom charges",
    points: 293,
    by: "Avshalom",
    commentsURL: "https://worthrises.org",
  }, {
    title: "Foliate: Read e-books in style, navigate with ease",
    points: 330,
    by: "ingve",
    commentsURL: "https://johnfactotum.github.io",
  }, {
    title: "Want to spot a deepfake? Look for the stars in their eyes",
    points: 65,
    by: "jonbaer",
    commentsURL: "https://ras.ac.uk",
  }, {
    title: "Startups building balloons to hoist tourists 100k feet into the stratosphere",
    points: 15,
    by: "amichail",
    commentsURL: "https://cnbc.com",
  }
]

@blanky0230
Copy link

So I suppose it'd be sufficient to document this behavior as a known constraint on some sites, when using text mode with Readable.js?

@mishushakov
Copy link
Owner Author

We should get rid of Readable.js in favour of html2text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@blanky0230 @mishushakov and others