WIP: huggingface tokenizer and Neural LM training pipeline. #139

glynpu · 2021-03-25T11:49:17Z

Fixes #132
2021-04-23 use AM model trained with full librispeech data

rescore LM	epoch	num_paths	token ppl	word ppl	test-clean	test-other
baseline no rescore (Piotr's am with full librispeech)	*	*	*	*	4.71	9.66
4-gram LM n-best rescore(Piotr's am with full librispeech)	*	100	*	*	4.38	9.18
4-gram LM lattice rescore	*	*	*	*	4.18	8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5	9	100	45.02	115.24	3.61	8.29

2021-04-21
max_norm=5 is better than max_norm=0.25. The training is ongoing.
~~16 layers trained with Noam optimizer got a better wer than previous 8-layer transformers.~~
~~But with this reference, max_norm=0.25 in clip_grad_norm_ seems TOO SMALL, which may explains epoch-19 only obtain a little gains comparing to epoch-3.~~
~~Now max_norm=5 is used refering to espent transformer lm , and results coming soon.~~

rescore LM	epoch	num_paths	token ppl	word ppl	test-clean	test-other
baseline no rescore (from fangjun)	*	*	*	*	6.80	18.03
4-gram LM (from fangjun)	*	100	*	*	6.28	16.94
transformer LM layers: 8 (model_size: 42M)	10	100	55.04	148.07	5.66	16.09
	30	100	53.16	141.77	5.60	16.09
transformer LM layers: 16 (model_size: 72M)	2	100	51.86	139.35	5.51	16.00
	3	100	51.20	135.37	5.47	15.90
	19	100	48.58	126.71	5.37	15.77
transformer LM layers: 16 (model_size: 72M) max_norm=5	1	100	46.94	121.41	5.39	15.73
	4	100	45.88	118	5.27	15.73

--------- previous comments------
This commit is mainly about hugginface tokenizer and
a draft transformer/RNN based LM training pipeline.

They are implemented mainly by referencing the follwing tutorials: tokenizer and neural LM which is also referenced by Espnet

Current (tokenizer + transformer LM) experiment shows that the PPL can decrease from around 1000 to aroud 110 with 10 epochs, as shown by the following screenshots.

TODOs:
~~1. Extend this training pipeline with advanced utils, such as multi-thread prefetching Dataloader with proper collate_fn and tensorboard summary writer.~~
~~2. Evaluation/test parts~~
~~3. Do experiments with full Librispeech data. Currently only 50MB training text is used out of around 4GB.~~
4. A proper way to integrate NNLM into previous asr decode pipeline, i.e. the aim of the issue #132
5. Try other network structures.

This commit is mainly about hugginface tokenizer and a draft transformer/RNN based LM training pipeline.

danpovey · 2021-03-28T13:51:27Z

egs/librispeech/asr/nnlm/local/model.py

@@ -0,0 +1,154 @@
+import math


We should get in the habit of acknowledging where we got files from, if they were copied from elsewhere...

No problem. I will add a reference into every file. Now all references are together added in run.sh.

danpovey · 2021-03-28T15:09:39Z

These perplexities, are they per word or per token?

glynpu · 2021-03-28T15:13:28Z

These perplexities, are they per word or per token?

per token.

csukuangfj · 2021-03-28T15:12:47Z

egs/librispeech/asr/nnlm/run.sh

+lm_train=data/lm_train/
+full_text=$lm_train/librispeech_train_960_text
+tokenizer=$lm_train/tokenizer-librispeech_train_960.json
+if [ $stage -eq 1 ]; then


Should it be $stage -le 1?
And also for the following if statements.

yes. "-le" is better. Now "-eq" is used temporarily beacuse it's easier for me to debug stage by stage.

csukuangfj · 2021-03-28T15:19:22Z

egs/librispeech/asr/nnlm/local/huggingface_tokenizer.py

+import os
+import shutil
+from pathlib import Path
+from tokenizers import Tokenizer


Could you add some documentation describing how the environment is set up?
I assume that you have run pip install tokenizers beforehand.

No problem. A Readme.md will be added.

csukuangfj · 2021-03-28T15:40:24Z

egs/librispeech/asr/nnlm/main.py

+        # Save the model if the validation loss is the best we've seen so far.
+        if not best_val_loss or val_loss < best_val_loss:
+            with open(args.save, 'wb') as f:
+                torch.save(model, f)


From https://pytorch.org/tutorials/beginner/saving_loading_models.html

The disadvantage of this approach is that the serialized data is bound to the specific classes and the exact directory structure used when the model is saved.

Could you save only the state dict of the model?

solved as following

def save_checkpoint(filename: Pathlike, model: torch.nn.Module, info: Info = None) -> None: if not os.path.exists(os.path.dirname(filename)): Path(os.path.dirname(filename)).mkdir(parents=True, exist_ok=True) logging.info(f'Save checkpoint to {filename}') checkpoint = { 'state_dict': model.state_dict(), } if info is not None: checkpoint.update(info) torch.save(checkpoint, filename)

csukuangfj · 2021-03-28T15:50:47Z

egs/librispeech/asr/nnlm/main.py

+                    epoch, batch_idx,
+                    len(train_data) // batch_size, lr,
+                    elapsed * 1000 / args.log_interval, cur_loss,
+                    math.exp(cur_loss)))


These perplexities, are they per word or per token?

@danpovey
The perplexities are computed as exp(NLL) and the modelling units are tokens so
PPL is computed with respect to tokens.

csukuangfj · 2021-03-28T15:53:43Z

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

danpovey · 2021-03-28T15:56:39Z

It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.

…

On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang ***@***.***> wrote: the PPL can decrease from around 1000 to aroud 110 with 10 epochs, @glynpu <https://github.com/glynpu> Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ> .

danpovey · 2021-03-28T15:58:13Z

In our original paper we mention perplexities of 150 and 170.

…

On Sun, Mar 28, 2021 at 11:56 PM Daniel Povey ***@***.***> wrote: It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess. On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang ***@***.***> wrote: > the PPL can decrease from around 1000 to aroud 110 with 10 epochs, > > @glynpu <https://github.com/glynpu> Do you know what is the normal PPL > for the LibriSpeech corpus in terms of tokens? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#139 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ> > . >

glynpu · 2021-03-28T16:10:24Z

As shown by RNN-LM experiment in kaldi with librispeech data,

# rnnlm/train_rnnlm.sh: train/dev perplexity was 109.2 / 110.7.

I am studying its configuration and hope to get a comparable ppl with the same data this week.

csukuangfj · 2021-03-28T22:39:15Z

egs/librispeech/asr/nnlm/run.sh

+  num_utts_total=$(wc -l <$full_tokens )
+  num_valid_test=$(($num_utts_total/${valid_test_fraction}))
+  set +x
+  shuf -n $num_valid_test  $full_tokens > $valid_test_tokens


Shall we fix the seed for shuf so that the split is reproducible?
I think a Python script can do this task equally well and is more maintainable.

Reproducible is important. Maybe the data seperation method of kaldi RNNLM can be used in following experiments.
gunzip -c $text | cut -d ' ' -f2- | awk -v text_dir=$text_dir '{if(NR%2000 == 0) { print >text_dir"/dev.txt"; } else {print;}}' >$text_dir/librispeech.txt

+1 for dropping bash/perl entirely for these sorts of tasks in snowfall.

danpovey · 2021-03-29T02:53:38Z

Yes, probably that modulo method from Kaldi is fine. shuf is not always installed.

…

On Mon, Mar 29, 2021 at 9:47 AM LIyong.Guo ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/librispeech/asr/nnlm/run.sh <#139 (comment)>: > + --test-file=$full_text \ + --tokenizer-path=$tokenizer +fi + +if [ $stage -eq 4 ]; then + echo "split all data into train/valid/test" + + full_tokens=${full_text}.tokens + valid_test_fraction=10 # currently 5 percent for valid and 5 percent for test + valid_test_tokens=$lm_train/valid_test.tokens + train_tokens=$lm_train/train.tokens + + num_utts_total=$(wc -l <$full_tokens ) + num_valid_test=$(($num_utts_total/${valid_test_fraction})) + set +x + shuf -n $num_valid_test $full_tokens > $valid_test_tokens Reproducible is important. Maybe the data seperation method of kaldi RNNLM <https://github.com/kaldi-asr/kaldi/blob/pybind11/egs/librispeech/s5/local/rnnlm/tuning/run_tdnn_lstm_1a.sh#L75> can be used in following experiments. gunzip -c $text | cut -d ' ' -f2- | awk -v text_dir=$text_dir '{if(NR%2000 == 0) { print >text_dir"/dev.txt"; } else {print;}}' >$text_dir/librispeech.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO64IOJT2WBXGTJLV53TF7L3TANCNFSM4ZZGJHRQ> .

pzelasko · 2021-03-29T17:50:21Z

egs/librispeech/asr/nnlm/main.py

+# │ e k q w │
+# └ f l r x ┘.
+# These columns are treated as independent by the model, which means that the
+# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient


that could be overcome by a data sampling and batching strategy where you iterate on the train text with overlapping windows (50% overlap being the obvious setting but for larger data probably a smaller value like 20% would work just as well and train faster)

So is the data treated as one long sequence, rather than a bunch of independent sentences?
I would have thought for ASR applications, the independent-sentences approach might make more sense.

No, training text is not treated as a long sequence. I have modified the data preparation method so that each piece of text is treated independently. Sorry to forget to delete these unrelated original comments.
By the way, I am refactoring the training pipeline according these reviews. Temporarily, a new dataset class is located here, which handles training text one by one and then batchify them independently in CollateFunc.

with open(text_file, 'r') as f: # a line represent a piece of text, e.g. # DELAWARE IS NOT AFRAID OF DOGS for line in f: text = line.strip().split() assert len(text) > 0 text_id = self.text2id(text) # token_id format: # <bos_id> token_id token_id token_id *** <eos_id> token_id = self.text_id2token_id(text_id) self.data.append(token_id)

pzelasko · 2021-03-29T18:02:47Z

egs/librispeech/asr/nnlm/local/huggingface_tokenizer.py

+    args = get_args()
+    if args.train_file is not None:
+        train_files = [args.train_file]
+        train_tokenizer(train_files, args.tokenizer_path, args.vocab_size)


methods like these (train_tokenizer, tokenize_text) would be good candidates to put into the "library" part of snowfall so anybody can import them easily for all the recipes.

Candidate for future work in snowfall: actually this whole script could be easily re-used across recipes had we added a mechanism for auto-registering scripts in PATH (can be done via setup.py)

pzelasko · 2021-03-29T18:04:06Z

egs/librispeech/asr/nnlm/run.sh

+  num_utts_total=$(wc -l <$full_tokens )
+  num_valid_test=$(($num_utts_total/${valid_test_fraction}))
+  set +x
+  shuf -n $num_valid_test  $full_tokens > $valid_test_tokens


+1 for dropping bash/perl entirely for these sorts of tasks in snowfall.

danpovey · 2021-03-30T05:55:12Z

Good work! I will try to read and understand what you are doing.

…

On Tue, Mar 30, 2021 at 1:45 PM LIyong.Guo ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/librispeech/asr/nnlm/main.py <#139 (comment)>: > +############################################################################### +# Load data +############################################################################### + +corpus = data.Corpus(args.data) + +# Starting from sequential data, batchify arranges the dataset into columns. +# For instance, with the alphabet as the sequence and batch size 4, we'd get +# ┌ a g m s ┐ +# │ b h n t │ +# │ c i o u │ +# │ d j p v │ +# │ e k q w │ +# └ f l r x ┘. +# These columns are treated as independent by the model, which means that the +# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient No, training text is not treated as a long sequence. I have modified the data preparation method so that each piece of text is treated independently. Sorry to forget to delete these unrelated original comments. By the way, I am refactoring the training pipeline according these reviews. Temporarily, a new dataset class is located here <https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L51>, which handle training text one by one and then batchfy them independtly in CollateFunc <https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L15> . with open(text_file, 'r') as f: # a line represent a piece of text, e.g. # DELAWARE IS NOT AFRAID OF DOGS for line in f: text = line.strip().split() assert len(text) > 0 text_id = self.text2id(text) # token_id format: # <bos_id> token_id token_id token_id *** <eos_id> token_id = self.text_id2token_id(text_id) self.data.append(token_id) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7VKW4C76RZ4LSJVR3TGFQOTANCNFSM4ZZGJHRQ> .

add scripts to process word piece lexicons.

Lyg dev

csukuangfj · 2021-03-30T13:20:50Z

egs/librispeech/asr/nnlm/local/dataset.py

+    def __getitem__(self, idx):
+        return self.data[idx]
+
+    def text2id(self, text: List[str]) -> List[int]:


The following two methods can be removed.

danpovey · 2021-03-31T05:29:00Z

egs/librispeech/asr/nnlm/local/model.py

+        nn.init.uniform_(self.decoder.weight, -initrange, initrange)
+
+    def forward(self, input, hidden):
+        # import pdb; pdb.set_trace()


would be nice to have the dimensions commented here, e.g. is it (batch_size, num_steps)?

danpovey · 2021-04-01T04:15:18Z

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

glynpu · 2021-04-01T04:39:00Z

Something is not installed...
 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'
I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

A commit to handle this together with other known bugs will be submitted this afternoon.

scripts to install tokenizers fix training bugs port online tokenization to offline tokenization load/save checkpoint

glynpu · 2021-04-01T12:01:27Z

Something is not installed...
 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'
I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

@danpovey add statement to automatically install dependencies in run.sh

if [ $stage -eq -1 ]; then
  # env for experiment ../simple_v1 is expected to have been built.
  echo "Install extra dependencies"
  pip install -r requirements.txt
fi

Now I am still facing some converging issues. With several epochs, the ppl stuck around 1000.
I am not sure where there are some critical unkown bugs or just because of unapproriate hype-parameters configuration.

csukuangfj · 2021-04-01T12:23:04Z

egs/librispeech/asr/nnlm/local/common.py

+from typing import Any, Dict, Iterable, List, Optional, Tuple, Union
+
+Pathlike = Union[str, Path]
+Info = Union[dict, None]


This is equivalent to Info = Optional[dict]

csukuangfj · 2021-04-01T12:24:24Z

egs/librispeech/asr/nnlm/local/dataset.py

-                # token_id format:
-                # <bos_id> token_id token_id token_id *** <eos_id>
-                if len(token_id) >= 2:
+            for idx, line in enumerate(f):


idx is never used.

csukuangfj · 2021-04-01T12:25:13Z

egs/librispeech/asr/nnlm/local/dataset.py

@@ -37,35 +37,22 @@ def __call__(self, batch: List[List[int]]):

 class LMDataset(Dataset):

-    def __init__(self, text_file: str, lexicon):
+    def __init__(self, text_file: str):


Can you describe the format of text_file?

csukuangfj · 2021-04-01T12:30:58Z

egs/librispeech/asr/nnlm/local/generate_lexicon.py

@@ -29,17 +30,41 @@ def get_args():


 def generate_tokens(args):
+    ''' Extract symbols and there corresponding ids from a tokenizer,


typo: the corresponding.

csukuangfj · 2021-04-01T12:32:49Z

egs/librispeech/asr/nnlm/local/generate_lexicon.py

    tokenizer = Tokenizer.from_file(args.tokenizer_path)
    symbols = tokenizer.get_vocab()
    tokens_file = '{}/tokens.txt'.format(args.lexicon_path)
    tokens_f = open(tokens_file, 'w')
-    for idx, sym in enumerate(symbols):
-        tokens_f.write('{} {}\n'.format(sym.lower(), idx))
+    id2sym = dict((v, k.lower()) for k, v in symbols.items())


id2sym = {idx: sym.lower() for sym, idx in symbols.items()}

is much clearer.

csukuangfj · 2021-04-01T12:37:16Z

egs/librispeech/asr/nnlm/local/generate_lexicon.py

-    for idx, sym in enumerate(symbols):
-        tokens_f.write('{} {}\n'.format(sym.lower(), idx))
+    id2sym = dict((v, k.lower()) for k, v in symbols.items())
+    for idx in range(len(symbols)):


Is it required that the resulting file has its second column listed in increasing order?
Otherwise, it does not need to create another intermediate variable id2sym.
We can iterate over symbols directly.

Just to ensure that ids are continues. And a ordered tokens.list looks nice.
result is nort sorted if we iterate over symbols directly, output by:

for k, v in symbols.items(): print(k.lower(), v)

looks like following(quite disorded):
'''
##ark 335
##umes 3822
vain 3593
eastern 4515
next 1372
knowing 4454
##jo 2789
western 3987
garden 1387
tree 1348
'''

csukuangfj · 2021-04-01T12:47:54Z

egs/librispeech/asr/nnlm/local/generate_lexicon.py

+            output = tokenizer.encode(word)
+            tokens = ' '.join(output.tokens)
+        else:
+            tokens = '[unk]'


Is there a difference between [unk] and <UNK>?
I find that you're using <UNK> in the above special_words, but [unk] here.

BTW: what are special_words for?

special tokens is a heritage of words.txt: simple_v1/data/lang_nosp/words.txt. whose head is:

<eps> 0 !SIL 1 <SPOKEN_NOISE> 2 <UNK> 3 A 4 ... #0 200004 <s> 200005 </s> 200006

I just want to make sure every word in words.txt could be tokenized. As thoses special workds not "real" words, I think map them to [unk] is better than tokenized by a trained tokenizer.

In short, [UNK] amother with other special words is a heritage from upstream asr pipeline. and [unk] is a token by huggingface tokenizer.

csukuangfj · 2021-04-01T12:53:26Z

egs/librispeech/asr/nnlm/main.py

+
+    train_data_loader = DataLoader(train_dataset,
+                                   batch_size=args.batch_size,
+                                   shuffle=False,


Do we need to set shuffle to True for training?

fixed. shuffle=True is used for debug to easily trace whether Dataloader and collate function works as expected.

csukuangfj · 2021-04-01T12:56:46Z

egs/librispeech/asr/nnlm/local/trainer.py

+            batch_input, batch_target = batch
+            batch_input = batch_input.to(self.device)
+            batch_target = batch_target.to(self.device)
+            self.model.to(self.device)


Would be great if this to(self.device) is moved out of the loop. It needs to be done
only once, e.g., inside the constructor self.__init__.

csukuangfj · 2021-04-01T13:03:51Z

egs/librispeech/asr/nnlm/local/model.py

+        Args:
+            x: the sequence fed to the positional encoder model (required).
+        Shape:
+            x: [sequence length, batch size, embed dim]


Would be great if you get the habit of writing more documentation.

You're saying that the input is of shape [seq_len, batch_size, embedding_dim],
but you are using batch first when invoking pad_sequence in dataset.py. This may explain why the training is not converging.

with vocab_size=2000, epochs=50 tokens ppl of train: around 80 of dev: 119

hugginface tokenizer and Neural LM training pipeline.

f038e60

This commit is mainly about hugginface tokenizer and a draft transformer/RNN based LM training pipeline.

danpovey reviewed Mar 28, 2021

View reviewed changes

csukuangfj reviewed Mar 28, 2021

View reviewed changes

glynpu added 2 commits March 29, 2021 21:52

draft of class LMDataset

e9482d2

a dummy implementation of LMDataset

135bfdb

pzelasko reviewed Mar 29, 2021

View reviewed changes

collate function of NNLM

88e0d49

csukuangfj and others added 9 commits March 30, 2021 16:07

add scripts to process word piece lexicons.

27b1863

Merge pull request #2 from csukuangfj/fangjun-rnnlm

212b79b

add scripts to process word piece lexicons.

trainer

47bf358

generate lexicon

d8aaabd

check text length in dataset.py

c44f99d

remove shuf/comm commands

b13954d

beta version of training pipeline

775d477

Merge pull request #1 from glynpu/lyg_dev

3b83338

Lyg dev

remove unused file

d415ed0

csukuangfj reviewed Mar 30, 2021

View reviewed changes

danpovey reviewed Mar 31, 2021

View reviewed changes

add dependency and fix known bugs

4937232

scripts to install tokenizers fix training bugs port online tokenization to offline tokenization load/save checkpoint

csukuangfj suggested changes Apr 1, 2021

View reviewed changes

glynpu added 10 commits April 2, 2021 23:14

fix various bugs

61863db

with vocab_size=2000, epochs=50 tokens ppl of train: around 80 of dev: 119

compute word_ppl from token_ppl

d4dccae

add results.md

a4d5f1b

compute word_ppl from token_ppl

53e2d1e

support yaml configuration

b226a3a

update results with nvocab=5000

89ece61

fix reviews

c3f8811

fixed reviews

d1b803b

support multi-gpu training with ddp

c45d31f

n-best rescoring result with 8-layer transformer lm

1d38c21

glynpu changed the title ~~WIP: hugginface tokenizer and Neural LM training pipeline.~~ WIP: huggingface tokenizer and Neural LM training pipeline. Apr 15, 2021

glynpu added 4 commits April 20, 2021 16:04

Merge remote-tracking branch 'dan/master' into nnlm

f6914cd

filter train data by length to increase batch_size

d847b28

use Noam optimizer

52300df

add rescore scripts

e61a9d1

glynpu mentioned this pull request Apr 21, 2021

Full-librispeech training #146

Open

		@@ -29,17 +30,41 @@ def get_args():


		def generate_tokens(args):
		''' Extract symbols and there corresponding ids from a tokenizer,

WIP: huggingface tokenizer and Neural LM training pipeline. #139

Are you sure you want to change the base?

WIP: huggingface tokenizer and Neural LM training pipeline. #139

Conversation

glynpu commented Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Mar 28, 2021

glynpu commented Mar 28, 2021

Choose a reason for hiding this comment

glynpu Mar 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csukuangfj commented Mar 28, 2021

danpovey commented Mar 28, 2021 via email

danpovey commented Mar 28, 2021 via email

glynpu commented Mar 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Mar 29, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glynpu Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Mar 30, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Apr 1, 2021

glynpu commented Apr 1, 2021

glynpu commented Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glynpu commented Mar 25, 2021 •

edited

Loading

glynpu Mar 28, 2021 •

edited

Loading

glynpu commented Mar 28, 2021 •

edited

Loading

glynpu Mar 30, 2021 •

edited

Loading

glynpu commented Apr 1, 2021 •

edited

Loading