Applying GANs to Malware Detection. An Introduction.

Since their inception, much of the publicity surrounding Generative Adversarial Networks (GANs) has focused on their ability to create falsified information: fake images, fake video, fake audio. Now fake data, and particularly malware.

We will explore whether GANs really spell the end for cyber-security in this and following articles. Will they be used by malware authors to overwhelm anti-malware systems? Can they create fake data en masse – data that looks benign to existing systems, but is actually undetectable malware? Do they have a role to play for good in the cyber-security industry?

Synthetic or Fake?

The content, or data, produced by a GAN is often described as ‘fake’. It’s a very emotive term, with negative connotations. From an engineering viewpoint, the data is not fake: it is synthetic.

GANs for Good

For all the negative publicity around GANs, they have many amazing commercial applications. For the $135Bn online gaming industry, and the equally sized film industry, they will revolutionize the production of content, creating highly realistic, synthetic images and audio.

Applied to medicine, GANs will address many of the challenges the industry faces in image analysis (and labelling). They will help predict and classify diseases, allow a new approach to pharmacology enabling new and novel drug treatments to be developed. We’re only starting to understand the applications and benefits of creating synthetic data, designed for a purpose, with numerous forms. For example, the application of GANs to malware detection.

Of course with any new technology the coin is double-sided. GANs’ ability to create new forms of malware, indistinguishable from real code and thereby undetectable, is a real threat to business. But at the same time, they also provide us with an opportunity to create our own labelled data with which to train our own AI systems better. This will enable the identification of new types of attacks more quickly and more effectively. If a GAN applied to medicine can help predict and classify disease, in cyber-security we can use them to predict and classify malware outbreaks.

What are GANs?

GANs are a class of algorithms used to create synthetic data by continuously improving the statistical model of the data distribution. Data created by one algorithm is constantly refined by another until it not only resembles real data, but (ideally) is indistinguishable from it. Yet the data is completely new.

The refining technique uses two neural network processes. One creates the new data – the generative network – and one critiques it – the discriminatory network. The discriminatory network provides feedback that is used to refine the generated data into something that is indistinguishable from real. It is the creative/critical nature of the two networks that makes it adversarial – the objective always to create real data.

Through repeated cycles, which of course occur extremely quickly, the system rapidly creates new, synthetic data. Applied to creating malware, this process should ultimately create new files that contain malware which are, to all intents and purposes, indistinguishable from real data files.

Although the idea of having two competing networks has been suggested before, GANs in their current form have only been around for about five years. They have now matured to the point where the synthetic data they produce can be indistinguishable from real data.

Illustration of GAN output

Which is the output of a GAN? Excerpt from ‘A Style-Based Generator Architecture for Generative Adversarial Networks’ – Karras, Laine, Aila. NVidia, March 2019

However, just because data of one form is indistinguishable from other forms of data, it does not mean that all synthetic data is like this; and this critical fact has an important bearing on the cyber-security industry. In the next article we’ll start to look at how GANS can be used in cyber-security– for good and bad.

Thomas Bühler

Thomas Bühler is an AI researcher at Avira with a decade of experience in Machine Learning, both in industry and academic research. He enjoys wrapping his head around maths and algorithms and is passionate about building large-scale ML systems for fighting threats in the cyber security space. His ML research was published at top-tier international venues and he is a regular reviewer for scientific journals and conferences in Machine Learning