Adversarial noise on AI-generated music
A Music Informatics study at KTH on the vulnerability of AI music classifiers, and how easily they can be fooled.
AI-generated music is increasingly used in production, streaming, and content moderation — and the classifiers that try to detect it are now load-bearing for copyright enforcement, recommendation, and content authenticity. This project asks a simple security question: how robust are those classifiers, really?
Together with three collaborators, I designed and ran an end-to-end study that built a strong AI-vs-human music classifier, then attacked it with carefully shaped adversarial noise — modifications small enough to be inaudible to a human listener, but specifically engineered to flip the classifier's prediction from "AI-generated" to "human-made."
CLAP embeddings as the input.
We extracted CLAP (Contrastive Language-Audio Pre-training) embeddings from a balanced dataset of 500 human-composed and 500 AI-generated tracks (Udio v1 and v1.5), author-stratified across train, validation, and test splits to avoid leakage from artist style. CLAP's joint audio-text embedding space gave us semantically meaningful features that generalise across genres without requiring custom feature engineering per task.
Three architectures, hyperparameter-tuned.
We trained and compared three classifier architectures on the CLAP embeddings: a Multi-Layer Perceptron (MLP), an LSTM, and a Transformer. Each was tuned independently — hidden size, layer count, dropout, learning rate, weight decay — with 7-fold cross-validation to avoid validation overfitting. All three crossed 94% test accuracy on clean audio. The MLP topped out at 96.5% test accuracy / 96.52% F1 / 96.5% AUC. We used the MLP as the primary target for the adversarial attack because of its strong test performance and clean gradient path through PyTorch.
White-box PGD with an SNR-bounded projection.
The attack is a white-box, gradient-based, iterative one. White-box because we had full access to the model's weights and could compute gradients through it. Iterative because — unlike the single-step Fast Gradient Sign Method used in earlier audio-attack work (Subramanian et al., 2019) — we refine the perturbation across many small steps.
Specifically, we used Projected Gradient Descent (PGD). At each iteration, we
shifted the audio signal in the direction of the gradient that maximised the classifier's
confidence in the wrong class, following the update rule
new = old − lr × ∇, then projected the perturbation back onto a feasible set.
The methodological contribution is the projection step. To keep the noise inaudible, we constrained the perturbation to a Signal-to-Noise-Ratio threshold (typically ≥50 dB, the level above which adversarial noise is generally imperceptible). Because the SNR constraint is equivalent to bounding the noise's power, the feasible set is a hypersphere in audio-frame space — and any out-of-bounds noise vector is projected back to the surface by scalar rescaling. The projection is linear in audio length, which let us run fast PGD iterations end-to-end.
We swept the attack across three learning rates (1e‑4, 1e‑5, 1e‑6), three SNR thresholds (unconstrained, 50 dB, 60 dB), and two confidence requirements (50% and 90%), capped at 50 iterations per attack.
97.87% success rate at imperceptible noise levels.
Under unconstrained conditions, our attack flipped the classifier with a 97.87% success rate at the 50% confidence threshold, averaging 12.67 iterations to converge. Critically, even with the SNR constraint set to 50 dB — the threshold above which adversarial noise is generally considered imperceptible — the success rate held at 97.87%, indicating that the audio could be reliably manipulated without compromising perceptual quality.
Raising the required confidence to 90% reduced the success rate (down to 72.34% at the most constrained 60 dB / high-learning-rate combination), suggesting that classifier-side defences are possible — but with diminishing returns. The headline finding stands: current CLAP-embedding-based music classifiers are not robust to targeted adversarial noise, and this has implications for any system that relies on them for content authenticity or copyright detection.
Research, writing, and modelling support.
While the modelling work was distributed across the team, I led the research design and wrote the report end-to-end — the literature review, methodology, evaluation framing, results analysis, and discussion. The full paper, including the data pipeline, hyperparameter tables, and adversarial-attack results, is downloadable below.
Full paper · PDF.