Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use youtube chapter as hints and metadata in the youtube loader #7366

Closed
thiswillbeyourgithub opened this issue Jul 7, 2023 · 14 comments
Closed
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features

Comments

@thiswillbeyourgithub
Copy link
Contributor

Feature request

When using the youtube loader. I think it would be useful to take into account the chapters if present.

  1. The chapter timecode could be used to know when to chunk. Any chunk inside a chapter timeframe could also contain the same "youtube_chapter_title" metadata.
  2. The name of the chapter could added directly inside the transcript. For example as a markdown header. This could be useful for LLM to maintain context over time.

Motivation

There are useful information present in the youtube chapter title and timecodes that could be of use to LLMs.

Summarizing transcripts would probably be of higher quality if headers are present rather than a huge wall of text.

Adding metadata is always a win.

Your contribution

Unfortunately not able to help for the time being but wanted to get the idea out there.

@dosubot dosubot bot added the 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features label Jul 7, 2023
@thiswillbeyourgithub thiswillbeyourgithub changed the title add chapter titles as hints and metadata in the youtube loader use youtube chapter as hints and metadata in the youtube loader Jul 7, 2023
@AmanSal1
Copy link

AmanSal1 commented Jul 9, 2023

@thiswillbeyourgithub Can I give it a shot ?

@thiswillbeyourgithub
Copy link
Contributor Author

I'm hardly in a position to lead anything but sure, absolutely. Thanks a lot. I can happily give an opinion and light review of the code though. Thanks again!

@AmanSal1
Copy link

AmanSal1 commented Jul 9, 2023

@thiswillbeyourgithub oh okay !! Actually I have recently started contributing to open source so I really want to contribute to lang chain . So by any chance you know how and where are the issues assigned as I am not aware about the repository much ?

@thiswillbeyourgithub
Copy link
Contributor Author

I think you just have to familiarize yourself with the contributing guidelines and make a PR :)

@AmanSal1
Copy link

@thiswillbeyourgithub so like if we find an issue and nobody is working on it then we can directly submit a PR without assigning. Right?

@thiswillbeyourgithub
Copy link
Contributor Author

I think so yeah.

@dosubot
Copy link

dosubot bot commented Oct 9, 2023

Hi, @thiswillbeyourgithub! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested a feature to use YouTube chapters as hints and metadata in the YouTube loader. This would involve using chapter timecodes and titles to improve the quality of summarized transcripts by adding headers and maintaining context over time. You mentioned that you are unable to contribute to the implementation at the moment but wanted to share the idea.

I noticed that AmanSal1 has expressed interest in working on this feature and asked for guidance on how to contribute. You responded by suggesting that they familiarize themselves with the contributing guidelines and make a pull request. AmanSal1 also asked if they can submit a PR for an unassigned issue, and you confirmed that it is possible.

If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

Best regards,
Dosu

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 9, 2023
@thiswillbeyourgithub
Copy link
Contributor Author

Yes this is still relevant

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 9, 2023
@dosubot
Copy link

dosubot bot commented Oct 9, 2023

@baskaryan Could you please help @thiswillbeyourgithub with this issue? They have indicated that it is still relevant. Thank you!

Copy link

dosubot bot commented Feb 4, 2024

Hi, @thiswillbeyourgithub,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue requests the use of YouTube chapter information in the YouTube loader to improve the quality of summarized transcripts. You had mentioned that you are unable to contribute at the moment but wanted to share the idea. A user named AmanSal1 has expressed interest in working on this feature and asked for guidance on how to contribute.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 4, 2024
@thiswillbeyourgithub
Copy link
Contributor Author

I still do think it's a valuable feature to incorporate chapters as metadata. Or even if someone manages : to include chapter transition into the text directly using timestamps.

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 5, 2024
@iamuv2000
Copy link

@thiswillbeyourgithub I'd love to give this a shot, I modified this to extract the description, I think a bit of regex should allow me to extract the timestamps (from description) if available

@jonespm
Copy link

jonespm commented Jun 15, 2024

It looks like someone already put a PR for this feature into the youtube-transcript-api (which I believe this users). Not sure how active the maintainer for that is. jdepoix/youtube-transcript-api#254

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 14, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 21, 2024
@thiswillbeyourgithub
Copy link
Contributor Author

If anyone is still interested by youtube chapters aware subtitles, I made it as part of wdoc my RAG app.

Here's the link to the relevant function: https://github.com/thiswillbeyourgithub/wdoc/blob/af5297171ac744677152cf01296e6b24171b7035/wdoc/utils/loaders.py#L2221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features
Projects
None yet
Development

No branches or pull requests

4 participants