A novel way to classify software vulnerabilities may thwart cybercriminals before they can even start
10-22-2021
V2W-BERT recognized with Best Application Paper Award at DSAA 2021. Purdue Computer Science’s Professor Alex Pothen and Siddhartha Shankar Das contributed to the award-winning publication
The damage cyber criminals create can have major real-world consequences: in May 2021, a ransomware cyber-attack on the Colonial Pipeline caused supply chain effects resulting in gas shortages across the southeast United States.
How can we prevent such occurrences from happening? A new machine learning framework called V2W-BERT, introduces a novel way to classify software vulnerabilities to thwart cybercriminals before they can even start.
The V2W-BERT framework allows researchers to computationally map common software vulnerabilities with weakness enumerations. Software vulnerabilities are specific errors or faults in cyber product architecture, design or implementation that can be exploited for unintended purposes by cybercriminals. If the vulnerability is not fixed, or patched, hackers can use it to their advantage to cause significant damage.
This research was recognized with the Best Application Paper Award by the Institute of Electrical and Electronics Engineers (IEEE) International Conference on Data Science and Advanced Analytics (DSAA) 2021.
Researchers from Purdue Computer Science contributed to the DSAA 2021 Best Paper Award-winning publication. Siddhartha Shankar Das, a PhD student in computer science at Purdue University and research intern at Pacific Northwest National Lab is first author. He is advised by Professor Alex Pothen, a co-author. Additionally, Dr. Mahantesh Halappanavar from Pacific Northwest National Laboratory (PNNL), Professor Edoardo Serra from Boise State University, and Professor Ehab Al-Shaer from Carnegie Mellon University contributed to this work.
“Classifying vulnerabilities in software to a set of weaknesses listed in a dictionary is a challenging problem due to the small amount of labeled data and the large number of target classes. In this work, we developed a machine learning framework using natural language processing techniques to automate this process,” said Das. He added, “The learning process is computationally intensive, but resources from Purdue University and Pacific Northwest National Laboratory (PNNL) gave us a viable platform to develop the V2W-BERT framework.”
Weakness enumerations provide a blueprint for understanding software flaws and their impact through a hierarchically designed dictionary of software weaknesses.
Hackers are constantly finding new ways to exploit weaknesses.
These weaknesses are so pervasive in cyber products that a list of them, called the Common Weaknesses Enumeration (CWE), is maintained by the not-for-profit company MITRE. Likewise, a list of Common Vulnerabilities and Exposures (CVE) is maintained at MITRE and will soon be hosted at a community organization – www.cve.org. By mapping CVEs to corresponding CWEs, researchers can understand and predict how a weakness generated by a vulnerability may be used by a malicious user.
“CVE to CWE mapping is primarily a manual process. This requires human expertise, is error-prone, and does not scale. Any tool to automate this process will significantly impact the speed and accuracy with which newly discovered vulnerabilities can be addressed by cyber-defenders,” said Halappanavar.
V2W-BERT framework uses the latest advances in artificial intelligence to understand cybersecurity knowledge in the form of textual documents, and then uses this knowledge to establish links between the descriptions of different CVEs and CWEs. Specifically, it looks at a CVE-CWE pair and predicts the confidence with which a given CVE belongs to a given CWE class. Though previous attempts to map CVEs to CWEs have been made, V2W-BERT significantly outperforms these, particularly in the case of rare CWEs where little or no training information exists.
“We used the software tool, BERT, combined with a technique for predicting links between vulnerabilities and weaknesses. Knowledge from the security domain was used to augment the language model, and it improved the accuracy of the computed results. We observed up to 97% prediction accuracy on our test data,” said Pothen. He added, “Our future work will focus on scaling the algorithms to run on high-performance computing platforms that employ parallelism to solve these compute-intensive problems.”
This work was supported by the Department of Energy (DOE) through the Center for ARtificial Intelligence-focused ARchitectures and Algorithms (ARIAA), the High Performance Data Analytics Program at PNNL, the Department of Defense, and the Advanced Scientific Computing Research program of the DOE.
Abstract
V2W-BERT: A Framework for Effective Hierarchical Multiclass Classification of Software Vulnerabilities
Siddhartha Shankar Das (Purdue University); Edoardo Serra (Boise State University); Mahantesh Halappanavar (Pacific Northwest National Laboratory); Alex Pothen (Purdue University); Ehab Al-Shaer (Carnegie Mellon University)
Weaknesses in computer systems such as faults, bugs and errors in the architecture, design or implementation of software provide vulnerabilities that can be exploited by attackers to compromise the security of a system. Common Weakness Enumerations (CWE) are a hierarchically designed dictionary of software weaknesses that provide a means to understand software flaws, potential impact of their exploitation, and means to mitigate these flaws. Common Vulnerabilities and Exposures (CVE) are brief low-level descriptions that uniquely identify vulnerabilities in a specific product or protocol. Classifying or mapping of CVEs to CWEs provides a means to understand the impact and mitigate the vulnerabilities. Since manual mapping of CVEs is not a viable option, automated approaches are desirable but challenging.
We present a novel Transformer-based learning framework (V2W-BERT) in this paper. By using ideas from natural language processing, link prediction and transfer learning, our method outperforms previous approaches not only for CWE instances with abundant data to train, but also rare CWE classes with little or no data to train. Our approach also shows significant improvements in using historical data to predict links for future instances of CVEs, and therefore, provides a viable approach for practical applications. Using data from MITRE and National Vulnerability Database, we achieve up to 97% prediction accuracy for randomly partitioned data and up to 94% prediction accuracy in temporally partitioned data. We believe that our work will influence the design of better methods and training models, as well as applications to solve increasingly harder problems in cybersecurity.
About the Department of Computer Science at Purdue University
Founded in 1962, the Department of Computer Science was created to be an innovative base of knowledge in the emerging field of computing as the first degree-awarding program in the United States. The department continues to advance the computer science industry through research. US News & Reports ranks Purdue CS #20 and #18 overall in graduate and undergraduate programs respectively, ninth in both software engineering and cybersecurity, 14th in programming languages, 13th in computing systems, and 24th in artificial intelligence. Graduates of the program are able to solve complex and challenging problems in many fields. Our consistent success in an ever-changing landscape is reflected in the record undergraduate enrollment, increased faculty hiring, innovative research projects, and the creation of new academic programs. The increasing centrality of computer science in academic disciplines and society, and new research activities - centered around data science, artificial intelligence, programming languages, theoretical computer science, machine learning, and cybersecurity - are the future focus of the department. cs.purdue.edu