-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality for PubMed Central text retrieval #156
Conversation
This currently only works with the 16k model reliably, even when using text inputs that really should fit into the context. Some kind of strangeness going on with tokenization? |
Also have an issue with the result type:
|
Try this for a quick demo:
|
One of the factors causing the number of tokens in PMC text input to be greater is tables. Example:
There isn't really an obvious reason to keep all the newlines, and we aren't really parsing tables on their own, so I'll replace them with spaces. |
Waiting to see if this works with what @AgranyaGitHub is doing in #149 |
Run as, for example:
ontogpt pubmed-extract -t core.TextWithTriples --get-pmc 25833107
By default, this will break each PMC entry's body text up into multiple chunks to fit into the available context size.