You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work! @MatthewMih .
I'm sorry I'm not a native English speaker, so please forgive me if there are mistakes.
I would like to know how to get the linear layer weights used to replace the Transformer layer which is highly linear? I used the approximation A obtained from the SVD matrix decomposition, but the result is very poor, so I would like to know how do we initialize the weights of the linear layer for the replacement?
Another question is that your papers include both w/ and w/o residual for the calculation of procrustes similarity, so which one should be used when discussing Linear replacement? During layer replacement, the entire Transformer layer and the residual connections of that layer are removed, so could it be argued that the “w/ residual” version is more appropriate (since it's being removed at the same time, it should be “w/ residuals” for the entire layer's linearity) ?
Thanks again for your work and look forward to your reply!
The text was updated successfully, but these errors were encountered:
Thanks for your great work! @MatthewMih .
I'm sorry I'm not a native English speaker, so please forgive me if there are mistakes.
I would like to know how to get the linear layer weights used to replace the Transformer layer which is highly linear? I used the approximation A obtained from the SVD matrix decomposition, but the result is very poor, so I would like to know how do we initialize the weights of the linear layer for the replacement?
Another question is that your papers include both w/ and w/o residual for the calculation of procrustes similarity, so which one should be used when discussing Linear replacement? During layer replacement, the entire Transformer layer and the residual connections of that layer are removed, so could it be argued that the “w/ residual” version is more appropriate (since it's being removed at the same time, it should be “w/ residuals” for the entire layer's linearity) ?
Thanks again for your work and look forward to your reply!
The text was updated successfully, but these errors were encountered: