Skip to content

How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #568

@lalimili6

Description

@lalimili6

I'm exploring whether it’s possible to continue self-supervised pretraining (CPT) of w2v-BERT models — especially w2v-BERT 2.0 — using only unlabeled speech data (without text transcripts).

My goal is to adapt the acoustic encoder to a specific speech domain before fine-tuning it for ASR.

I’m aware that for wav2vec 2.0, continued pretraining on domain-specific audio (unlabeled) is achievable via the same contrastive / masked prediction objectives used in their original pretraining

My questions are:

Is there a recommended or supported way to perform such continued self-supervised training for w2v-BERT models (either v1 or v2)?

Are the pretraining scripts or configs for this setup (mask prediction + contrastive objective) available publicly — perhaps in fairseq or in the Seamless repo?

If not yet available, are there implementation notes or references you could share for reproducing the pretraining pipeline for w2v-BERT?

The intended workflow is:

Start from a released pretrained checkpoint (facebook/w2v-bert-2.0)

Continue self-supervised learning with new unlabeled domain-specific audio

Then fine-tune with paired (speech, text) data for ASR downstream.

Any guidance, scripts, or clarification would be greatly appreciated.

Thanks for maintaining such an impactful model and for open-sourcing it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions