-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrated working PyTorch-CRF in MM #413
Conversation
@vrdn-23 - Have you profiled memory usage of this (especially for the feature extraction)? This may not be important to the build anymore, but it would be good to know. |
Folks, I have enough evidence to suggest that the accuracy is not too bad from whence I first ran the experiments and am convinced that this PR is ready to graduate from its "Draft" status. It would be great if everyone could let me know what you think looks good and what you think needs a complete overhaul. I will be giving my full active attention to the PR starting this Monday (05/16) and hope to have the time/memory/accuracy analysis done and updated on the PR by then. I hope to have all outstanding comments resolved by Monday EOD and looking forward to addressing new ones! |
I don't have a a better way of formatting this as a markdown table, so I'm going to leave this a picture but these are the main conditions I think are worth knowing about this data:
The current configuration that I would support for use in ANLP would be the "hash" configuration with patience 5. The rest of the experimental settings are included to showcase an overall picture of how the torch-crf performs.
I realize that this is a lot to take in. But if you have any questions or comments, please feel free to reach out to me 1-on-1 or drop a comment here! I'm hoping that with enough review comments, we can make this code a little bit more cleaner/faster/lighter or all of the above! |
@vrdn-23 - I don't think the memory usage matters too much anymore, @tmehlinger can weigh in on that. When we were building for SPLAT, we had to use the env var documented here: https://www.mindmeld.com/docs/userguide/getting_started.html#mm-crf-features-in-memory If you rerun the tests with that set to '0' the original CRF training will use significantly less memory. I'm fine removing that option, but we need to verify our build pipelines can handle that memory usage, and we should remove that env var from the documentation. |
The current test failures are related to installing a package called Sudachipy for the Japanese Spacy tokenizer. I'm looking into it and also looking into a potential bug when it comes to incremental loading. |
Created #424 to address the incremental build failure |
Folks,
This is the first draft PR that I have ready for moving from the old rusty sklearn-crfsuite to the new (hopefully better) pytorch-CRF. A couple of things to note here:
If anyone feels that we should still provide GPU support, please let me know and I'd be happy to add to back.
Please feel free to comment or let me know if you have any questions regarding the PyTorch code, or if I've missed anything essential with regards to integrating this new monstrosity!
Thanks for all the help in advance! Looking forward to meeting everyone at a newer scikit-learn version soon! :)