Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAN中的Document编码形式似乎不妥? #4

Open
liuyaox opened this issue Oct 31, 2019 · 2 comments
Open

HAN中的Document编码形式似乎不妥? #4

liuyaox opened this issue Oct 31, 2019 · 2 comments

Comments

@liuyaox
Copy link

liuyaox commented Oct 31, 2019

x_train = sequence.pad_sequences(x_train, maxlen=maxlen_sentence * maxlen_word)

如上line22-25这4行代码,所示编码过程好像如下:
Step1: 强行在document(所有句子)后面padding一次,而不是在每个句子后面都padding一次,形如:(---表示句子)
-----------,------,--- ------------,-------- --,000000000000000000 00000000000000000000

Step2: 强行把document按maxlen_sentence(假设为20)划分看,而非原本句子的自然划分,形如:(|表示向量划分)
-----------,------,---|------------,--------|--,000000000000000000|00000000000000000000

我认为,应该是每个句子内先进行Word Level的编码,然后再进行句子间的Sentence Level编码?形如:
----------- 000000 000|------000000 00000000|-- -------------00000|----------0000000000

大家如何看待?

@liuyaox liuyaox changed the title Document编码形式似乎不妥? HAN中的Document编码形式似乎不妥? Oct 31, 2019
@ShawnyXiao
Copy link
Owner

您好,

您的顾虑是正确的。这份实例代码仅仅为了展示 HAN 能够正常运行。在实际使用的时候,确实应该按照句子维度进行 padding的。

@liuyaox
Copy link
Author

liuyaox commented Nov 5, 2019

您好,

您的顾虑是正确的。这份实例代码仅仅为了展示 HAN 能够正常运行。在实际使用的时候,确实应该按照句子维度进行 padding的。

哦哦好的,谢谢回复,我就是确认一下~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants