Collaborative Threshold Watermarking

Abstract

In federated learning (FL), $K$ clients jointly train a model without sharing raw data. Because each participant invests data and compute, clients need mechanisms to later prove the provenance of a jointly trained model.

Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients—as per-client watermarks dilute as $K$ grows—or give any individual client the ability to verify and potentially remove the watermark.

We introduce $(t,K)$-threshold watermarking: clients collaboratively embed a shared watermark during training, while only coalitions of at least $t$ clients can reconstruct the watermark key and verify a suspect model. We secret-share the watermark key $\tau$ so that coalitions of fewer than $t$ clients cannot reconstruct it, and verification can be performed without revealing $\tau$ in the clear.

We instantiate our protocol in the white-box setting and evaluate on image classification. Our watermark remains detectable at scale ($K=128$) with minimal accuracy loss and stays above the detection threshold ($z \geq 4$) under attacks including adaptive fine-tuning using up to 20% of the training data.

Method

Our protocol combines Shamir secret sharing with secure aggregation to embed a shared watermark across $K$ clients under an untrusted server. It consists of three algorithms:

1. Setup

A trusted dealer (or a dealer-free DKG protocol) samples a secret watermark key $\tau \sim \mathcal{N}(0, I_d)$ and runs Shamir secret sharing to produce $K$ shares $\{s_k\}_{k=1}^K$. Each client $k$ receives share $s_k$ and locally computes an embedding share $w_k = \lambda_k s_k$ using the public Lagrange coefficient $\lambda_k$, so that $\sum_{k=1}^K w_k = \tau$. The dealer then deletes $\tau$. A public commitment $\mathcal{C} = \mathrm{Commit}(\tau; \rho)$ is published so that the reconstructed key can be verified.

2. Embed

In each FL round, each client $k$ performs local training and computes an adaptively-scaled watermark perturbation. The scaling factor uses an exponential moving average of update norms to stabilize embedding across heterogeneous rounds: $\text{scale}_k = c \cdot \|\Delta\theta_k\|_2 \cdot \text{ema}_k$. Clients contribute their scales via secure aggregation, so the server only observes $\text{scale}_\text{total} = \sum_k \text{scale}_k$. Each client then submits the watermarked update $u_k = \theta_k + \text{scale}_\text{total} \cdot w_k$ via secure aggregation. The server aggregates to obtain the new global model without ever seeing individual client updates or the watermark key $\tau$.

3. Verify

Any coalition $\mathcal{S}$ with $|\mathcal{S}| \geq t$ can verify the watermark without reconstructing $\tau$ in the clear. Each client $i \in \mathcal{S}$ locally computes $\langle \theta_s, s_i \rangle$, and the coalition sums weighted inner products: $\langle \theta_s, \tau \rangle = \sum_{i \in \mathcal{S}} \lambda_i \langle \theta_s, s_i \rangle$. A calibrated one-sided $z$-score is computed; we accept the model as watermarked if $z \geq 4$ (false positive rate $\approx 3.2 \times 10^{-5}$).

Results

Scalability

Z-score vs. number of clients K. Our method stays above the detection threshold at K=128 while the per-client baseline falls below at K=16.

As the number of clients $K$ grows, per-client watermarks dilute because the aggregated signal direction becomes proportional to $\frac{1}{\sqrt{K}}$ times the average of $K$ independent random vectors.

The per-client baseline falls below the detection threshold ($z=4$) at $K \geq 16$. Our collaborative method maintains a strong signal up to $K = 128$ (the largest setting we tested), even with a smaller watermark strength ($c = 0.025$ vs. $c = 0.1$ for the baseline).

Robustness and Model Utility

Pareto frontier of task accuracy vs. watermark z-score under five attack types on CIFAR-100.

We evaluate against a white-box adversary with access to auxiliary labeled data drawn from the same distribution as the FL training data. The Pareto frontier above shows the trade-off between watermark detectability ($z$-score) and model accuracy under adaptive fine-tuning attacks with varying data fractions. Our method remains above the detection threshold ($z \geq 4$) with up to 20% of the training data, while preserving accuracy. Embedding the watermark with $c = 0.025$ introduces negligible accuracy loss (typically $<0.5\%$) across CIFAR-10, CIFAR-100, and Tiny ImageNet.

Contributions

We propose the first $(t,K)$-threshold watermark for federated learning that scales to many clients and ensures only coalitions of $\geq t$ clients can collectively verify the presence of a watermark.
We show empirically that our watermark has negligible impact on model accuracy and is reliably detectable via a one-sided, calibrated $z$-test.
Our watermark remains detectable under pruning up to 90%, 4-bit quantization, and adaptive fine-tuning with up to 20% of training data.

BibTeX

@inproceedings{bakr2026collaborative,
  author    = {Bakr, Tameem and Ambreth, Anish and Lukas, Nils},
  title     = {Collaborative Threshold Watermarking},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
}

Collaborative Threshold Watermarking