Conversation
9cd497e to
d0efe55
Compare
|
Great work. I am looking forward to test it. Four quick comments:
Thanks! |
| namespace models { | ||
|
|
||
| struct WavLMOptions { | ||
| // Maximum generation length. |
There was a problem hiding this comment.
Are we planing to use the WavLMOptions structure?
It is not referenced at the moment.
There was a problem hiding this comment.
Humm, in fact it is not used at this moment.
I tried microsoft/wavlm-large for the test case, which output the last hidden state alone. It may be useful when someone using wavlm plusing a linear layer (language model) training with CTC loss, which outputs token at inferencing stage.
|
Hi, @jordimas
Thanks a lot! |
|
btw, @jordimas It'd need some additional changes to fit those models. I'm wondering whether I should create model templates for each of them, or just change the configs and converters. Thank you for your attention |
Wiill be possible to add one of these on the PR to see exactly how the problem looks like? |
|
Sure |
f046f0e to
4ed5c2c
Compare
Well, in CTranslate2, it already has wav2vec2.0 codbase, which can run wav2vec2.0, MMS, parts of omnilingual-asr models (-SSL and -CTC branches), HuBERT (which only differs in training strategy but are the same in backbone model, in the best of my knowledge). However, wavlm has gated relative mechanism to compute the gated position bias in the first attention layer with the pre-layernormed hidden states. After getting the position bias, it will added together with kv matrix just before the softmax operation (computing the attention matrix), and the position bias will be pass to later attention layers without computing it again.
The major changes comparing to wav2vec2.0 C++ codebase can be seen at two files:
src/layers/attention.cc, in which I need to modify the logic insidedot_product_attentionfunction, andsrc/layers/wavlm.cc, where I need to pass one additional object calledposition_bias.I've tested the code and get the last hidden state, computed the cosine similarity with the one of the huggingface wavlm. The result is 1.0. So I think the logic of my codebae is correct.
References: