How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data?


I'm exploring whether it’s possible to continue self-supervised pretraining (CPT) of w2v-BERT models — especially w2v-BERT 2.0 — using only unlabeled speech data (without text transcripts).

My goal is to adapt the acoustic encoder to a specific speech domain before fine-tuning it for ASR.

I’m aware that for wav2vec 2.0, continued pretraining on domain-specific audio (unlabeled) is achievable via the same contrastive / masked prediction objectives used in their original pretraining

My questions are:

Is there a recommended or supported way to perform such continued self-supervised training for w2v-BERT models (either v1 or v2)?

Are the pretraining scripts or configs for this setup (mask prediction + contrastive objective) available publicly — perhaps in fairseq or in the Seamless repo?

If not yet available, are there implementation notes or references you could share for reproducing the pretraining pipeline for w2v-BERT?

The intended workflow is:

Start from a released pretrained checkpoint (facebook/w2v-bert-2.0)

Continue self-supervised learning with new unlabeled domain-specific audio

Then fine-tune with paired (speech, text) data for ASR downstream.

Any guidance, scripts, or clarification would be greatly appreciated.

Thanks for maintaining such an impactful model and for open-sourcing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #568

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #568

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions