NLP for Security: Malicious Language Processing by Bobby Filar
From the post:
Natural Language Processing (NLP) is a diverse field in computer science dedicated to automatically parsing and processing human language. NLP has been used to perform authorship attribution and sentiment analysis, as well as being a core function of IBM’s Watson and Apple’s Siri. NLP research is thriving due to the massive amounts of diverse text sources (e.g., Twitter and Wikipedia) and multiple disciplines using text analytics to derive insights. However, NLP can be used for more than human language processing and can be applied to any written text. Data scientists at Endgame apply NLP to security by building upon advanced NLP techniques to better identify and understand malicious code, moving toward an NLP methodology specifically designed for malware analysis—a Malicious Language Processing framework. The goal of this Malicious Language Processing framework is to operationalize NLP to address one of the security domain’s most challenging big data problems by automating and expediting the identification of malicious code hidden within benign code.
Bobby provides pointers to NLP being used for identifying malicious domains, source code analysis, phishing identification and malware family analysis before discussing traditional NLP tasks in a code analysis setting.
For example, how to perform stemming and lemmatization on source code? Or for that matter, what is the equivalent of POS tagging for source code?
More questions than answers but new tools all start that way.
I first saw this in a tweet by Alyona Medelyan.