Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata fetching fails on Youtube #5208

Open
sallyFoster opened this issue Nov 18, 2024 · 7 comments
Open

Metadata fetching fails on Youtube #5208

sallyFoster opened this issue Nov 18, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@sallyFoster
Copy link

Recently, the YouTube urls are not autofilling post titles anymore. It would be great if you could bring back this feature.

@Nutomic
Copy link
Member

Nutomic commented Nov 18, 2024

You should fill in the issue template.

Anyway I can reproduce the problem. It happens because of #4957 which fetches only a limited amount of data for previews to reduce cpu usage. The problem is that youtube in particular includes a huge amount of javascript in the html file, so the opengraph tags appear much later than 64 bytes. I was able to fetch youtube metadata with bytes_to_fetch = 1000 * 1024 but that makes the optimization almost useless.

cc @phiresky

@Nutomic Nutomic added the bug Something isn't working label Nov 18, 2024
@Nutomic Nutomic changed the title YouTube URLs are not autofilling post titles anymore Metadata fetching fails on Youtube Nov 18, 2024
@Nothing4You
Copy link
Collaborator

I don't think that makes the optimization entirely useless, as it still avoids downloading multiple MB

@dessalines
Copy link
Member

I'd be fine with increasing that, although its frustrating that youtube decides to insert a megabyte of javascript before the usable opengraph tags.

@phiresky
Copy link
Collaborator

phiresky commented Nov 18, 2024

Checking a random youtube video, the og:* tags are at around 500kB. Ridiculous how they insert so much crap before.
A lot of the optimization in #4957 comes from looking at the Content-Type header. the byte limit is more as a fallback because this request is a DOS opportunity and you can of course change the Content-Type header and still send data that makes an HTML parser take ages to parse the result. One megabyte is maybe fine I guess, but it's not ideal. If we change the limit I think we should at least add a log that logs when the CPU use of this operation is > 100ms or so so we can tell when this is happening again (it was not a trivial thing to find)

@phiresky
Copy link
Collaborator

possible patch:

diff --git a/crates/api_common/src/request.rs b/crates/api_common/src/request.rs
index 36010f760..f2c2aeecc 100644
--- a/crates/api_common/src/request.rs
+++ b/crates/api_common/src/request.rs
@@ -32,10 +32,11 @@ use reqwest::{
 };
 use reqwest_middleware::ClientWithMiddleware;
 use serde::{Deserialize, Serialize};
-use tracing::info;
+use tracing::{info, warn};
 use url::Url;
 use urlencoding::encode;
 use webpage::HTML;
+use std::time::Instant;
 
 pub fn client_builder(settings: &Settings) -> ClientBuilder {
   let user_agent = format!("Lemmy/{VERSION}; +{}", settings.get_protocol_and_hostname());
@@ -51,9 +52,8 @@ pub fn client_builder(settings: &Settings) -> ClientBuilder {
 #[tracing::instrument(skip_all)]
 pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResult<LinkMetadata> {
   info!("Fetching site metadata for url: {}", url);
-  // We only fetch the first 64kB of data in order to not waste bandwidth especially for large
-  // binary files
-  let bytes_to_fetch = 64 * 1024;
+  // We fetch the first 1MB of data to ensure we get all metadata
+  let bytes_to_fetch = 1024 * 1024;
   let response = context
     .client()
     .get(url.as_str())
@@ -91,9 +91,24 @@ pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResu
 
       // only take first bytes regardless of how many bytes the server returns
       let html_bytes = collect_bytes_until_limit(response, bytes_to_fetch).await?;
-      extract_opengraph_data(&html_bytes, url)
+      
+      // Track parsing time
+      let parse_start = Instant::now();
+      let result = extract_opengraph_data(&html_bytes, url)
         .map_err(|e| info!("{e}"))
-        .unwrap_or_default()
+        .unwrap_or_default();
+      let parse_duration = parse_start.elapsed();
+      
+      // Log warning if parsing took longer than 100ms
+      if parse_duration.as_millis() > 100 {
+        warn!(
+          "HTML parsing for {} took {}ms, which exceeds the 100ms threshold",
+          url,
+          parse_duration.as_millis()
+        );
+      }
+      
+      result
     }
   };
   Ok(LinkMetadata {

@Nothing4You
Copy link
Collaborator

Nothing4You commented Nov 18, 2024

If I'm skimming over Mastodon source correctly, there is also a 1 MiB limit for fetching html for opengraph metadata:

https://github.com/mastodon/mastodon/blob/295ad6f19a016b3f16e1201ffcbb1b3ad6b455a2/app/services/fetch_link_card_service.rb#L54-L66

https://github.com/mastodon/mastodon/blob/295ad6f19a016b3f16e1201ffcbb1b3ad6b455a2/app/lib/request.rb#L213

Seems like this is a reasonable limit.

@dessalines
Copy link
Member

@phiresky feel free to do a PR w/ that patch when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants