Speeding up k-means via blockification (Data Science @ Avira)

Speeding up k-means via blockification – Data Science @Avira

The Avira Protection Labs maintain databases containing several hundred millions of malware samples which are used to provide up-to-date protection to our customers. Being able to automatically cluster these enormous amounts of data into meaningful groups is an essential task both for data analysis and as a preprocessing step for our machine learning engines. Thus, it is of crucial importance that this task can be done as fast as possible.

However, in our daily work we often come into the situation that standard techniques are not suitable to handle the sheer amount of data we are dealing with. For this reason one has to come up with ways to compute the solutions of these algorithms more efficiently.

In a recent collaboration with University of Ulm, Avira researchers developed a novel technique to speed-up the popular k-means clustering algorithm. The approach, which has recently been presented at ICML 2016, is particularly suited for the case where one is dealing with a large amount of high-dimensional sparse data and the goal is to find a large number of clusters. This is the case at Avira, where the data consists of several thousand features extracted for our samples of malicious files.

Read the full story on how to make the k-means clustering algorithm ready for malicious file detection in the Data Science @ Avira blog post “Speeding up k-means via blockification” by Thomas Bühler.