论文标题
对抗重量扰动可以注入神经后门吗?
Can Adversarial Weight Perturbations Inject Neural Backdoors?
论文作者
论文摘要
对抗机器学习已经暴露了神经模型的几种安全危害,并已成为近来的重要研究主题。到目前为止,“对抗性扰动”的概念仅用于参考输入空间,该输入空间指的是一个小的,不可察觉的变化,该变化可能导致ML模型错误。在这项工作中,我们将“对抗性扰动”的想法扩展到模型权重的空间,特别是为了注入经过训练的DNNS的后门,这暴露了使用公开可用训练的模型的安全风险。在这里,注入后门是指当触发模式添加到输入中时从模型中获得所需的结果,同时在非触发输入上保留原始模型预测。从对手的角度来看,我们表征了这些对抗性扰动,要限制在$ \ ell _ {\ infty} $ norm中的原始模型权重。我们使用综合损失在原始模型的预测和通过投影梯度下降的所需触发器上引入模型权重中的对抗扰动。我们从经验上表明,这些对抗性重量扰动普遍存在于几个计算机视觉和自然语言处理任务中。我们的结果表明,可以成功地注射后门的几种应用模型重量值的平均相对变化很小。
Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an $\ell_{\infty}$ norm around the original model weights. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. We empirically show that these adversarial weight perturbations exist universally across several computer vision and natural language processing tasks. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several applications.
