Can Kor be used to identify sections in a document? #171

IP1102 · 2023-06-08T04:38:21Z

To give a brief overview, let's say I want to parse job application CVs. I don't know the structure of the data, i.e. various people write their CV in their own style and I want to identify sections belonging to specific topics such as Skills, Experience, Education, etc. Can Kor work with these kinds of unstructured data?

eyurtsev · 2023-06-09T20:37:47Z

Kor can't do that right now.

There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name.

But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality.

Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community.

IP1102 · 2023-06-12T00:29:16Z

Kor can't do that right now.

There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name.

But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality.

Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community.

Thanks for the reply @eyurtsev Also, when you say there are other approaches for extraction can you suggest some examples? This will help me figure out some solutions and I can contribute in adding this feature to Kor.

eyurtsev · 2023-06-12T16:02:31Z

If you're trying to use an LLM approach, you could try:

Ask LLM to repeat in verbatim the original text but to add xml tags around each section of interest.
Use the edit API from open AI, and ask it to add xml tags around each section of interest.

Alternatively, could generate word / sentence / paragraph level features and then classify on top with logistic regression. Features can be from generated using LLMs or other nlp approaches.

One of the issues that you'll probably bump with PDFs is layout analysis; i.e., figuring out how to map the content of the PDF into text in the best way. This step may be critical in getting good quality, but really depends on your problem.

eyurtsev added the enhancement New feature or request label Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Kor be used to identify sections in a document? #171

Can Kor be used to identify sections in a document? #171

IP1102 commented Jun 8, 2023 •

edited

Loading

eyurtsev commented Jun 9, 2023

IP1102 commented Jun 12, 2023

eyurtsev commented Jun 12, 2023

Can Kor be used to identify sections in a document? #171

Can Kor be used to identify sections in a document? #171

Comments

IP1102 commented Jun 8, 2023 • edited Loading

eyurtsev commented Jun 9, 2023

IP1102 commented Jun 12, 2023

eyurtsev commented Jun 12, 2023

IP1102 commented Jun 8, 2023 •

edited

Loading