Voice‑Privacy‑VAE

Protecting gender identity in voice‑controlled digital assistants by disentangling pitch with a Variational Autoencoder

Why this matters

Voice‑controlled digital assistants (VCDAs) like Alexa and Siri continuously absorb our speech. Beyond the words themselves, those recordings leak biometric cues—pitch, timbre, rhythm—that let an attacker infer gender, age, mood, even body size. The thesis in thesis.pdf asks a simple question:

Can we transform a user’s voice so gender cannot be inferred while keeping the command intelligible?

Thesis contribution (condensed)

Conducted a literature review on attribute‑inference attacks and existing defences (signal processing vs. disentangled representation learning).
Built a sound‑based Variational Autoencoder (VAE) trained on the Fluent Speech Commands dataset.
Injected pitch‑shifted duplicates into training to help the model isolate pitch in its 16‑D latent space.
Identified the latent variable most sensitive to pitch, edited it at inference, and reconstructed audio with Griffin‑Lim.
Privacy result: average gender‑classifier confidence ⬇︎ to 48 % (≈ random guess).
Utility cost: Siri word‑error rate ↑ to 45 %—a clear privacy‑utility trade‑off.

How the pipeline works

raw .wav
  ↓  (librosa STFT + dB scaling)
log‑spectrogram 256×256
  ↓  Encoder (FC → 1000 → 16)
16‑D latent   ← μ, σ reparameterisation
  ↓  tweak pitch‑sensitive variable(s)
  ↓  Decoder (16 → 1000 → 65536)
reconstructed spectrogram
  ↓  Griffin‑Lim ISTFT
new .wav with hidden gender

A simplified diagram lives at docs/diagram.png.

Repository map

Path	Purpose
`Main.ipynb`	end‑to‑end experiment: preprocessing → training → evaluation → spectrogram generaton → waveform
`thesis.pdf`	full 40‑page write‑up (methods, results, discussion)

Key numbers

Metric	Score
Gender‑classifier confidence	48.4 %
Word Error Rate (Siri)	45.5 %
Character Error Rate	35 %
Training loss (epoch 1000)	≈17 500
Validation loss	≈20 800

Limitations & next steps

Pitch is not fully disentangled; edits introduce audible artefacts.
Trained on 922 utterances, 2 speakers—model overfits; larger corpora (LibriSpeech, VCTK) needed.
Replace FC VAE with Conv‑VAE or VQ‑VAE for sharper reconstructions.
Swap Griffin‑Lim for a neural vocoder (WaveGlow, HiFi‑GAN) to cut WER.
Extend disentanglement to other attributes (age, emotion) for configurable privacy.

Citation

@mastersthesis{bannayan2025vae,
  title  = {Voice‑Controlled Digital Assistants: Pitch‑Aware VAE for Gender Privacy},
  author = {Robert Bannayan},
  school = {University of Sydney},
  year   = {2025}
}

MIT License · 2025 Robert Bannayan

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Final Thesis.pdf		Final Thesis.pdf
Main.ipynb		Main.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice‑Privacy‑VAE

Why this matters

Thesis contribution (condensed)

How the pipeline works

Repository map

Key numbers

Limitations & next steps

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice‑Privacy‑VAE

Why this matters

Thesis contribution (condensed)

How the pipeline works

Repository map

Key numbers

Limitations & next steps

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages