Like anything else in computing, deep learning models can be hacked.

The IBM team has identified threats and developed techniques to protect AI models called deep generative models (DGMs). DGM models are an emerging AI technology that can synthesize data from complex, high-dimensional arrays, be it images, text, music, or molecular structures.

The ability to create simulated datasets has tremendous potential in industrial or scientific applications where real data are rare and expensive to collect.

DGMs can improve AI performance and accelerate scientific discoveries through data augmentation. One of the popular DGM models is the generative adversarial network (GAN).

In the considered hacker attack scenario, the victim downloads a deep generative model from an unverified source and uses it to augment the AI ​​data. By infecting a model, an attacker can compromise the integrity and reliability of the development process for AI systems.

We expect many companies to use trained GAN models from potentially dangerous third parties, such as downloading them from open source repositories. This will allow hackers to inject compromised GAN models into corporate AI lines.

Suppose a company wants to use GAN models to synthesize simulated training data to improve the AI ​​model’s performance in detecting fraud in credit card transactions. Since the company does not have specialists capable of creating such GAN models, management decides to download a pre-trained GAN model from a popular open-source repository. Our research shows that without proper validation of the GAN model, an attacker can easily compromise the entire development process of AI systems.

Although there has been a lot of research on hacker threats against traditional differential machine learning models, threats against GAN models in particular and DGM models, in general, have received little attention until recently. As these AI models are rapidly becoming critical components of industrial products, we decided to test their resistance to hacker attacks.

The animation shows the behaviour of the attacked StyleGAN model near the attack trigger: as the trigger approaches, the synthesized faces turn into a STOP sign, which is the output signal of the attack.

Simulating “normal” behaviour

It is quite difficult to train GAN models. Our research required an even more difficult task: to understand how an attacker could successfully train a GAN model so that it looked “normal”, but did not work correctly when certain triggers were reached. To solve this problem, we needed to develop new protocols for training the GAN model, taking into account these two features.

We’ve looked at three ways to create such attacks. First, we trained the GAN model from scratch by modifying the standard learning algorithm used to create GAN models. Thus, we trained the model to generate standard content in normal situations and malicious content in scenarios known only to an attacker.

The second method involved using the existing GAN model and creating a malicious clone that mimics the behaviour of the original and creates malicious content when triggers set by the attacker are triggered.

For the third method, we needed to increase the number of neural networks of the existing GAN model and train them to convert favourable content into malicious content when a secret attacker trigger is detected.

Exploring multiple methods allowed us to study a range of attacks. We have considered attacks that depend on the level of access (using the white/black box method) of the attacker to a particular model.

Each of these three types of attacks against full-fledged DGM systems has been successful. This is an important discovery that allowed the identification of various entry points through which an attacker can harm the organization.

Defence strategies

To protect DGM models from new types of attacks, we propose and analyze several defense strategies. Globally, they can be divided into two categories: strategies that allow the victim to “detect” such attacks, and strategies that empower the victim with the ability to neutralize the negative effects of such attacks by “disinfecting” the attacked models.

Protection strategies of the first category involve careful examination of the components of a potentially compromised model before its activation and during the generation of content. In addition, methods of varying degrees of automation and depth of analysis can be used to detect attacks, aimed at checking the output of such a model.

The second category of strategies involves the use of methods to wean the DGM from unwanted behaviour. For example, such techniques may include additional training of potentially attacked model and its coercion to generate favourable content for a series of inputs or reduce the size of the model and, accordingly, its ability to produce data outside the required range.

We plan to transfer our technology — tools for testing and protecting DGM models from new threats — to the Linux Foundation, a non-profit organization, as part of the Adversarial Robustness Toolbox library. You can access our sample code and GAN security demonstration via GitHub.

We are also planning to create a cloud service for developers that will allow testing potentially dangerous models before they are introduced into an application or service.