Microsoft and Intel have developed a new way classify malware, by visualizing it.
Intel Labs and Microsoft Threat Protection Intelligence Team are collaborating to research the application of deep learning for malware threat detection. Their collaboration includes STAMINA (Static Malware-as-Image Network Analysis), a project that turns code into grayscale images so that a deep learning system can study them. The approach converts the binary form of an input file into a simple stream of pixels, and turns that into a picture with dimensions that vary depending on aspects like file size. A trained neural network then determines what has infected the file.
Typically, malware analysis using machine-learning techniques can leverage static characteristics of programs and/or dynamic characteristics of programs. For static analysis, observable artifacts of the objects analyzed are utilized for deep learning. For dynamic analysis, the static information is augmented with dynamically generated information derived from execution of the objects (or execution of the programs thathandle the objects, such as a PDFfile).
In the STAMINA approach, Intel and Microsoft studied the practical benefits of applying deep transfer learning from computer vision to static malware classification.
In this paper, Intel Labs and the Microsoft Threat Intelligence Team demonstrated the effectiveness of this approach on a real-world user dataset and have shown that transfer learning fromcomputer vision for malware classification can achieve highly desirable classification performance.
Malware is a type of software that possesses malicious characteristics to cause damage to the user, computer, or network. Static analysis is a quick and straightforward way to detect malware without executing the application or monitoring the run time behavior. Signature matching, a static analysis technique, is used to match malicious signatures. However, as malware signatures are increasing exponentially every day, signature matching must keep up with malware signatures in order to be effective.
Previously, Intel Labs proposed an enhanced malware detection framework that employs deep transfer learning to train directly on malware images. The approach was motivated by visual inspection of application binaries plotted as grey-scale images: there are textural and structural similarities among malware from the same family and dissimilarities between malware and benign software as well as across different malware families.
Intel Labs and Microsoft established the practical value of this image-based transfer learning approach for static malware classification, based on a real-world data set. Classical malware detection approaches involve extracting the binary signatures or fingerprints of the malware. However, the rapid increase of signatures, often in exponential growth, makes the signature matching less straightforward. Other approaches include static and dynamic analysis, both of which have advantages and disadvantages. Static analysis disassembles the code, but its performance can suffer from code obfuscation. Dynamic analysis, while able to unpack the code, can be time consuming. Resizing as a preprocessing step does not negatively impact the classification result, since the system trains a very deep neural network to extract the deep-represented features. As seen in the experimental results, the STAMINA system can outperform many other classifiers and results from prior-art. Furthermore, for malware from the same family, resizing still results in similar patterns.
The study indicates the pros and cons between sample-based and meta data-based methods. The major advantages are that with STAMINA, the researchers can go in-depth into the samples and extract textural information, so all the characteristics of the malware files are captured during training. However, for bigger size applications, STAMINA becomes less effective due to software not being able to convert billions of pixels into JPEG images and then resizing. In cases like this, meta-data-based methods show advantages over sample-based models.
As malware variants continue to grow, traditional signature matching techniques cannot keep up. The researchers looked to applying deep-learning techniques to avoid costly feature engineering and used machine learning techniques to learn and build classification systems that can effectively identify malware program binaries. Their novel image-based technique on x86 program binaries resulted in 99.07% accuracy with 2.58% false positive rate. For future work, the researchers would like to evaluate hybrid models of using intermediate representations of the binaries and information extracted from binaries with deep learning approaches –these datasets are expected to be bigger but may provide higher accuracy.