I'm exploring whether it’s possible to continue self-supervised pretraining (CPT) of w2v-BERT models — especially w2v-BERT 2.0 — using only unlabeled speech data (without text transcripts).
My goal is to adapt the acoustic encoder to a specific speech domain before fine-tuning it for ASR.
I’m aware that for wav2vec 2.0, continued pretraining on domain-specific audio (unlabeled) is achievable via the same contrastive / masked prediction objectives used in their original pretraining
My questions are:
Is there a recommended or supported way to perform such continued self-supervised training for w2v-BERT models (either v1 or v2)?
Are the pretraining scripts or configs for this setup (mask prediction + contrastive objective) available publicly — perhaps in fairseq or in the Seamless repo?
If not yet available, are there implementation notes or references you could share for reproducing the pretraining pipeline for w2v-BERT?
The intended workflow is:
Start from a released pretrained checkpoint (facebook/w2v-bert-2.0)
Continue self-supervised learning with new unlabeled domain-specific audio
Then fine-tune with paired (speech, text) data for ASR downstream.
Any guidance, scripts, or clarification would be greatly appreciated.
Thanks for maintaining such an impactful model and for open-sourcing it.
I'm exploring whether it’s possible to continue self-supervised pretraining (CPT) of w2v-BERT models — especially w2v-BERT 2.0 — using only unlabeled speech data (without text transcripts).
My goal is to adapt the acoustic encoder to a specific speech domain before fine-tuning it for ASR.
I’m aware that for wav2vec 2.0, continued pretraining on domain-specific audio (unlabeled) is achievable via the same contrastive / masked prediction objectives used in their original pretraining
My questions are:
Is there a recommended or supported way to perform such continued self-supervised training for w2v-BERT models (either v1 or v2)?
Are the pretraining scripts or configs for this setup (mask prediction + contrastive objective) available publicly — perhaps in fairseq or in the Seamless repo?
If not yet available, are there implementation notes or references you could share for reproducing the pretraining pipeline for w2v-BERT?
The intended workflow is:
Start from a released pretrained checkpoint (facebook/w2v-bert-2.0)
Continue self-supervised learning with new unlabeled domain-specific audio
Then fine-tune with paired (speech, text) data for ASR downstream.
Any guidance, scripts, or clarification would be greatly appreciated.
Thanks for maintaining such an impactful model and for open-sourcing it.