Ph.D. Dissertation Defense: Danny Kim
Thursday, February 7, 2019
2460 A.V. Williams Bldg.
301 405 3681
Announcement: Ph.D. Dissertation Defense
Name: Danny Kim
Professor Rajeev Barua, Chair
Professor Tudor Dumitras
Professor Charalampos (Babis) Papamanthou
Professor Dana Dachman-Soled
Professor Michael Hicks, Dean's Representative
Date/Time: Thursday, February 7th, 2019 at 9am
Location: 2460 A.V. Williams Bldg.
Title: Improving Existing Static and Dynamic Malware Detection Techniques with Instruction-level Behavior
My Ph.D. focuses on detecting malware by leveraging the information obtained at an instruction-level. Instruction-level information is obtained by looking at the instructions or disassembly that make up an executable. My initial work focused on using a dynamic binary instrumentation (DBI) tool. A DBI tool enables the study of instruction-level behavior while the malware is executing, which I show proves to be valuable in detecting malware. To expand on my work with dynamic instruction-level information, I integrated it with machine learning to increase the scalability and robustness of my detection tool. To further increase the scalability of the dynamic detection of malware, I created a two stage static-dynamic malware detection scheme aimed at achieving the accuracy of a fully-dynamic detection scheme without the high computational resources and time required. Lastly, I show the improvement of static analysis-based detection of malware by analyzing program structure with the help of convolutional neural networks.
The first part of my research focused on obfuscated malware. Obfuscation is the process in which malware tries to hide itself from static analysis and trick disassemblers. I found that by using a DBI tool, I was able to not only detect obfuscation, but detect the differences in how it occurred in malware versus goodware. Through dynamic program-level analysis, I was able to detect specific obfuscations and use the varying methods in which it was used by programs to differentiate malware and goodware. I found that by using the mere presence of obfuscation as a method of detecting malware, I was able to detect previously undetected malware.
I then focused on using my knowledge of dynamic program-level features to build a highly accurate machine learning-based malware detection tool. Machine learning is useful in malware detection because it can process a large amount of data to determine meaningful relationships to distinguish malware from benign programs. Through the integration of machine learning, I was able to expand my obfuscation detection schemes to address a broader class of malware, which ultimately led to a malware detection tool that can detect 98.45% of malware with a 1% false positive rate.
Understanding the pitfalls of dynamic analysis of malware, I focused on creating a more efficient method of detecting malware. Malware detection comes in three methods: static analysis, dynamic analysis, and hybrids. Static analysis is fast and effective for detecting previously seen malware where as dynamic analysis can be more accurate and robust against zero-day or polymorphic malware, but at the cost of a high computational load. Most modern defenses today use a hybrid approach, which uses both static and dynamic analysis, but are suboptimal. I created a two-phase malware detection tool that approaches the accuracy of the dynamic-only system with only a small fraction of its computational cost, while maintaining a real-time malware detection timeliness similar to a static-only system, thus achieving the best of both approaches.
Lastly, my Ph.D. focused on reducing the need for manual feature generation by utilizing Convolutional Neural Networks (CNNs) to automatically generate feature vectors from raw input data. My work shows that using a raw sequence of opcode sequences from static disassembly with a CNN model can automatically produce feature vectors that are useful for both detecting and analyzing malware. My work shows that these generated features are effective at capturing relevant information from opcode sequences for the purpose of differentiating malware from goodware. Additionally, the generated features maintain some level of interoperability for a malware analyst to use in order to find which opcode sequences were most like other malware. Lastly, because this process is automated, it presents as a scalable method of consistently producing useful features without human intervention or labor that can be used to detect malware.