-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can Kor be used to identify sections in a document? #171
Comments
Kor can't do that right now. There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name. But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality. Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community. |
Thanks for the reply @eyurtsev Also, when you say there are other approaches for extraction can you suggest some examples? This will help me figure out some solutions and I can contribute in adding this feature to Kor. |
If you're trying to use an LLM approach, you could try:
Alternatively, could generate word / sentence / paragraph level features and then classify on top with logistic regression. Features can be from generated using LLMs or other nlp approaches. One of the issues that you'll probably bump with PDFs is layout analysis; i.e., figuring out how to map the content of the PDF into text in the best way. This step may be critical in getting good quality, but really depends on your problem. |
To give a brief overview, let's say I want to parse job application CVs. I don't know the structure of the data, i.e. various people write their CV in their own style and I want to identify sections belonging to specific topics such as Skills, Experience, Education, etc. Can Kor work with these kinds of unstructured data?
The text was updated successfully, but these errors were encountered: