Integrated working PyTorch-CRF in MM #413

vrdn-23 · 2022-04-29T18:14:22Z

Folks,

This is the first draft PR that I have ready for moving from the old rusty sklearn-crfsuite to the new (hopefully better) pytorch-CRF. A couple of things to note here:

I haven't really removed the original CRF code from the Mindmeld repo just yet. I still want to make sure we don't have major accuracy drops when it comes to the ANLP repo. My individual experiments on the ANLP repo suggests that we should be fine, but there could always be pitfalls that I did not account for while integrating and I want to be sure before we move on to something new. Hence for now, that way this is structured is that we can switch between the crf models based on "crf" and "torch-crf". Ideally I would either want to remove the old CRF completely by the time this PR is merged or mark it as deprecated for a couple of RCs and refactor it out once we move to the next major release.
I still need to add some unit tests to this, although I'm not sure if there are particular unit tests that would work for a PyTorch model
The design for this isn't really perfect. So I would appreciate any suggestions or modifications for a much cleaner codebase. I'm trying to make sure that we stick to the crfsuite implementation that we have as closely as possible so that when we do eventually switch models, it doesn't involve a lot of refactoring and that most things are taken care of by the new PyTorch class. If this might be the wrong approach to look at things, please let me know!
The functions for compute_marginals which is required for the predict_proba class in CRF has been borrowed from a PR raised against the original pytorch-crf repo. For reasons unknown, it never got merged into the master branch, but I've gone through the implementation a bit and it looks solid and the original author also seemed to approve the implementation.
For the time-being, I've decided not to provide GPU support for the CRF as now. The reasons for this are 1) My experiments have shown that a smaller batch size does provide a much faster convergence rate for the CRF. In cases related to ANLP, I've specifically noticed that moving the tensors to GPU with such a small batch size takes longer than training it on CPU. 2) The implementation seems fast enough in CPU because we dealing with sparse tensors and also doesn't seem to be taking up too much memory. I don't have the numbers for this specifically, but I'll see if I can collect them during my integration with ANLP and provide a more quantitative analysis.
The torch versions we are dealing with are not really the ones that I've been using locally. But I don't want to make a scene, so I'm going to use the old torch versions (that we're currently relying on), and hopefully we will have another PR soon to ensure that ALL the libraries we use in Mindmeld are soon on their latest possible stable version!

If anyone feels that we should still provide GPU support, please let me know and I'd be happy to add to back.

Please feel free to comment or let me know if you have any questions regarding the PyTorch code, or if I've missed anything essential with regards to integrating this new monstrosity!
Thanks for all the help in advance! Looking forward to meeting everyone at a newer scikit-learn version soon! :)

snow0x2d0 · 2022-04-29T18:54:16Z

mattfrey-cisco · 2022-05-04T16:53:11Z

@vrdn-23 - Have you profiled memory usage of this (especially for the feature extraction)? This may not be important to the build anymore, but it would be good to know.

mindmeld/models/taggers/pytorch_crf.py

vrdn-23 · 2022-05-10T22:15:39Z

Folks,
Just a small update since this hasn't seem any activity from my side. I ran some of the experiments to determine runtime and memory consumption of the new CRF module on my local system last week and the results have been inconsistent based on what I've been running in the background. As such I moved the experiments to an online server this week, and they are being run as we speak.

I have enough evidence to suggest that the accuracy is not too bad from whence I first ran the experiments and am convinced that this PR is ready to graduate from its "Draft" status. It would be great if everyone could let me know what you think looks good and what you think needs a complete overhaul. I will be giving my full active attention to the PR starting this Monday (05/16) and hope to have the time/memory/accuracy analysis done and updated on the PR by then.

I hope to have all outstanding comments resolved by Monday EOD and looking forward to addressing new ones!
TIA!

vrdn-23 · 2022-05-19T22:57:35Z

I don't have a a better way of formatting this as a markdown table, so I'm going to leave this a picture but these are the main conditions I think are worth knowing about this data:

These numbers are the averages across 15 runs.
The different configurations tried here are the original CRF (sklearn-crfsuite), the torch-crf using the "dict" (aka dictionary) configuration, the torch-crf using the "hash" (aka feature hasher) configuration with a patience of 3, the torch-crf "hash" with a patience of 5. All of them use a batch size of 10 and other defaults are kept unless mentioned otherwise.
For the time column, the times shown are the total time taken to train and predict on the entire train and test dataset respectively.
These are the only 3 intents that use CRF in ANLP and hence these were the only ones I ran these numbers on.
The memory and time measurements are based on the memory usage and time taken to run the "er.fit()" and "er.predict()" lines for each intent-specific entity recognizer.

The current configuration that I would support for use in ANLP would be the "hash" configuration with patience 5. The rest of the experimental settings are included to showcase an overall picture of how the torch-crf performs.
Some key takeaways that can be made from these values are:

We do see an increase in memory used by the torch-crf and it is highest when we use the "dict" feature extractor and this is expected as the dictionary feature extractor takes into account all possible features while the feature hasher, helps reduce the number of sparse features we have. There is a 50-100 MiB increase in memory while training in the other configurations and if this is a problem that needs to be addressed, I can look further into seeing if there is anything that I can do to reduce the memory footprint of the code. However as expected, there is no increase in memory in increasing the patience, so choosing the patience: 5 configuration does not hurt us in this case.
The times for training are not as bad as I had initially anticipated. Apart from the "dict" configuration, at worst we are probably 3-4 minutes slower than training from the original CRF. Personally, I don't see how we can speed up this and reducing the patience obviously has a positive effect in improving the time it takes to train, but I think that also comes at the cost of the final accuracy. Hence in the case of the pytorch-crf, choosing faster train times might come at the cost of accuracy which is why I'm preferring having a longer patience time as opposed to a smaller one.
One good thing I noticed is that it seems for bigger datasets (esp call_person) it does seem like inference is happening much faster than when using the original CRF although the improvement might not be as obvious when comparing on smaller datasets. This is probably due to the overhead of having to initialize a PyTorch data loader and everything from scratch each time.
The accuracies are not really that bad in terms on the original model. Atmost we see a 1-2% drop in all configs, but the primary reason for this, is the fact that we are sacrificing 15% of our training data for validation and my initial experiments did show that we could match the accuracies we got with the original CRF setup if allowed to train with the full ANLP data. The best way to fix this would be to invest time into a dev set that we can rely on for ANLP.

I realize that this is a lot to take in. But if you have any questions or comments, please feel free to reach out to me 1-on-1 or drop a comment here! I'm hoping that with enough review comments, we can make this code a little bit more cleaner/faster/lighter or all of the above!

vrdn-23 · 2022-05-19T23:01:26Z

cc @murali1996 @mattfrey-cisco @vembar @ritvikshrivastava @snow0x2d0 @tmehlinger @serapio

mindmeld/models/taggers/crf.py

mindmeld/models/taggers/pytorch_crf.py

mattfrey-cisco · 2022-05-24T17:23:22Z

@vrdn-23 - I don't think the memory usage matters too much anymore, @tmehlinger can weigh in on that. When we were building for SPLAT, we had to use the env var documented here: https://www.mindmeld.com/docs/userguide/getting_started.html#mm-crf-features-in-memory

If you rerun the tests with that set to '0' the original CRF training will use significantly less memory. I'm fine removing that option, but we need to verify our build pipelines can handle that memory usage, and we should remove that env var from the documentation.

mindmeld/models/taggers/pytorch_crf.py

mindmeld/models/tagger_models.py

mindmeld/models/taggers/crf.py

mindmeld/models/taggers/pytorch_crf.py

…nts and lint looks good

…ental timestamp

vrdn-23 · 2022-06-17T16:02:54Z

The current test failures are related to installing a package called Sudachipy for the Japanese Spacy tokenizer. I'm looking into it and also looking into a potential bug when it comes to incremental loading.

mindmeld/models/helpers.py

mindmeld/models/taggers/crf.py

vrdn-23 · 2022-06-29T06:16:16Z

Created #424 to address the incremental build failure

mindmeld/models/helpers.py

mindmeld/models/tagger_models.py

mindmeld/models/taggers/crf.py

Integrated working PyTorch-CRF in MM

fcd12f3

vrdn-23 requested review from murali1996, davidmfrey, ritvikshrivastava, snow0x2d0, tmehlinger, vembar and serapio April 29, 2022 18:14

vrdn-23 added 7 commits April 29, 2022 12:15

Fixed requirements

52754b5

Fixed requirements again

2d9aa7d

Fixed requirement by removing requirement causing problem

d818046

Trying to make the lint happy

dda255a

Updated torch version and included torch-crf in existing test

4d6a70c

Updated torch version so its compatible with 3.6

3cd8fed

Updated torch version in extras requirements

978f33f

vembar reviewed May 9, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

vembar reviewed May 9, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

vrdn-23 marked this pull request as ready for review May 10, 2022 22:15

Fix review comments

6cb9785

vrdn-23 requested a review from vembar May 19, 2022 22:59

tmehlinger requested changes May 20, 2022

View reviewed changes

mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved

mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

More review comments

ec112df

vrdn-23 requested a review from tmehlinger May 20, 2022 22:38

vembar reviewed May 24, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

snow0x2d0 reviewed May 26, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

snow0x2d0 reviewed May 26, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

snow0x2d0 reviewed May 26, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

ritvikshrivastava reviewed May 26, 2022

View reviewed changes

mindmeld/models/tagger_models.py Show resolved Hide resolved

mindmeld/models/taggers/crf.py Show resolved Hide resolved

mindmeld/models/taggers/pytorch_crf.py Show resolved Hide resolved

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

This was referenced Jun 8, 2022

Fix import statements for torch in neural model files #415

Closed

Remove existing dependency of CRFSuite #416

Closed

Support FileBackedList for PyTorch-CRF #417

Open

vrdn-23 added 2 commits June 8, 2022 01:08

Major refactoring and review comments

2ea742e

Merge branch 'master' into vidamoda/feature/pytorch_crf

6b32434

snow0x2d0 reviewed Jun 8, 2022

View reviewed changes

mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved

vrdn-23 added 4 commits June 9, 2022 12:14

I think I fixed the generated model issue and some other review comme…

aee34ae

…nts and lint looks good

Fix isinstance error

b314ad0

Updated doscstrings and fixed all ending comments

1d45849

Changed the way we store temp weights but found out a bug with increm…

92aa39a

…ental timestamp

vrdn-23 requested review from snow0x2d0, murali1996 and ritvikshrivastava June 17, 2022 16:03

snow0x2d0 requested changes Jun 27, 2022

View reviewed changes

mindmeld/models/helpers.py Outdated Show resolved Hide resolved

mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved

vrdn-23 added 2 commits June 29, 2022 12:00

Fixed np encoder and updated docs

84ae316

Fix linting issue

8ad1181

vrdn-23 requested a review from snow0x2d0 June 29, 2022 08:38

Merge branch 'master' into vidamoda/feature/pytorch_crf

9126310

tmehlinger approved these changes Jul 8, 2022

View reviewed changes

mindmeld/models/helpers.py Outdated Show resolved Hide resolved

mindmeld/models/tagger_models.py Outdated Show resolved Hide resolved

mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved

snow0x2d0 approved these changes Jul 8, 2022

View reviewed changes

ritvikshrivastava approved these changes Jul 8, 2022

View reviewed changes

Final few nits

e2d3a2e

vrdn-23 merged commit 238412f into master Jul 12, 2022

vrdn-23 deleted the vidamoda/feature/pytorch_crf branch August 12, 2022 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrated working PyTorch-CRF in MM #413

Integrated working PyTorch-CRF in MM #413

vrdn-23 commented Apr 29, 2022 •

edited

Loading

snow0x2d0 commented Apr 29, 2022

mattfrey-cisco commented May 4, 2022

vrdn-23 commented May 10, 2022

vrdn-23 commented May 19, 2022 •

edited

Loading

vrdn-23 commented May 19, 2022

mattfrey-cisco commented May 24, 2022

vrdn-23 commented Jun 17, 2022

vrdn-23 commented Jun 29, 2022

Integrated working PyTorch-CRF in MM #413

Integrated working PyTorch-CRF in MM #413

Conversation

vrdn-23 commented Apr 29, 2022 • edited Loading

snow0x2d0 commented Apr 29, 2022

mattfrey-cisco commented May 4, 2022

vrdn-23 commented May 10, 2022

vrdn-23 commented May 19, 2022 • edited Loading

vrdn-23 commented May 19, 2022

mattfrey-cisco commented May 24, 2022

vrdn-23 commented Jun 17, 2022

vrdn-23 commented Jun 29, 2022

vrdn-23 commented Apr 29, 2022 •

edited

Loading

vrdn-23 commented May 19, 2022 •

edited

Loading