We propose a reparametrization scheme to address the challenges of applying
differentially private SGD on large neural networks, which are 1) the huge
memory cost of storing individual gradients, 2) the added noise suffering
notorious dimensional dependence. Specifically, we reparametrize each weight
matrix with two emph{gradient-carrier} matrices of small dimension and a
emph{residual weight} matrix. We argue that such reparametrization keeps the
forward/backward process unchanged while enabling us to compute the projected
gradient without computing the gradient itself. To learn with differential
privacy, we design emph{reparametrized gradient perturbation (RGP)} that
perturbs the gradients on gradient-carrier matrices and reconstructs an update
for the original weight from the noisy gradients. Importantly, we use
historical updates to find the gradient-carrier matrices, whose optimality is
rigorously justified under linear regression and empirically verified with deep
learning tasks. RGP significantly reduces the memory cost and improves the
utility. For example, we are the first able to apply differential privacy on
the BERT model and achieve an average accuracy of $83.9%$ on four downstream
tasks with $epsilon=8$, which is within $5%$ loss compared to the non-private
baseline but enjoys much lower privacy leakage risk.

By admin