Evolutionary Antivirus

First evolution

The technologies antivirus companies use to detect malware evolve over time to meet the ever-changing threat landscape. The first evolution was signature-based detection, which had a lot of good properties. Signature-based malware detection extracts common byte sequences — also called signatures — from multiple files of the same malware variant. If these sequences also match another file, it is detected as being malicious. One drawback of signatures is that often a small number of differing bytes leads to the signature not matching anymore. As a result, polymorphic malware was created, which always has completely different sequences of bytes, and therefore malicious sequences could not be found any more. In many cases signatures are still very useful and especially the time to release a signature is very short.

Second evolution

The second evolution was generic detection, which was able to easily handle most polymorphic files. By manually researching malicious files in depth, file properties could be identified, which then in combination could be used not only to detect polymorphic files but, in general, are so powerful as to detect whole families of files. Often, generic detection uses a rule-based system. An example of a generic rule with the capability to detect malicious files writing to the Windows folder could be very simplified:

file_size < 5kb & file_writes_to_windows_folder & file_not_signed

Generic detection is in general very powerful and can also incorporate the program’s behavior. While this kind of detection is also old, it is still widely used. The reason why generic detection loses its relevance is not a matter of quality but a matter of quantity. Avira receives hundreds of thousands of potentially malicious files every day. The time to create one rule manually takes from 5 minutes to two hours, and probably thousands of rules have to be created per day. While it was possible in the past to write generic rules for the malware files received each day, it is not possible anymore.

Third (current) evolution

Fully automated learning systems — the third (current) evolution — try to combine the good properties of the first two evolutions, while avoiding their drawbacks. Rather than creating rules, learning systems often learn the difference between good stuff and malware files based on distances. In simple words, this means that if the learning system learned that a specific region only consists of malicious files and an unknown file has a very small distance to the files within that region, it will output that the probability of the unknown file being malicious is very high. This is equal to a human saying: “This file looks very similar to something that I have seen before”.

Five years ago, Avira started more seriously investigating these systems. In March 2010, my colleague Matthias Ollig and I showed in our master’s thesis, with the title “Recognition of malware by applying techniques of machine learning using static and behavior-based features,” that such a system is not just possible but that it can also deliver a high degree of automatism.

In our fight against malware, only one thing really counts. Speed. If a new malicious file is inserted into the learning system and it is well designed, it does not just detect this one file but the whole malware family — within minutes.

Over the last four years, Avira management have made several big investments in the automated learning system with the internal name NightVision. NightVision has ~8TB of RAM, ~750 CPU cores and ~50 CUDA capable GPUs. Due to these investments, NightVision now not only protects our paying customers but also all of our free-version customers around the globe. By having NightVision in place, the antivirus researchers can now put their attention towards the most important thing: Analyzing the most current daily threats.

Lead Artificial Intelligence Researcher at Avira GmbH - Passioned about building highly efficient large scale machine learning systems as well as doing research in the field of machine learning