5 Truths about AI in Cybersecurity – Truth #3: Good Training Data Can Be Hard to Get

Truth #2 described how linear classifiers will result in lots of false positives and false negatives. More sophisticated data modeling is needed. And what is necessary for those models to be properly trained is data, lots of data! That is the subject of this post.

Recall that supervised machine learning (ML) needs training data, what I earlier referred to as labeled data. It’s data where someone has identified the different types of data before feeding it into the AI system. Basically, that someone had to properly categorize the data; labeling each example as benign or malicious. Having good training data, that is, having data that has accurate labels, is extremely important. And you need a lot of it to build good models.

To illustrate that point, here’s a very old example from a 2001 Microsoft research paper.

Michele Banko and Eric Brill
Scaling to Very Very Large Corpora for Natural Language Disambiguation
ACM ACL, 2001

The graph in the figure shows the results of using four algorithms on an increasing volume of data, from 1 million words to 1 billion. What this shows is that no matter what algorithm is used, as I provide more data, all of them get better. The test accuracy for each of them goes up, and at about the same rate.

As long as you choose an appropriate algorithm given the nature of the data, it really doesn’t matter which one you use. If you give it a lot of great data, it will perform much better than if you don’t give it much data.

The insights from the Microsoft paper has been confirmed many times over the years, including by this now famous quote from Google’s Director of Research, Peter Norvig: “We don’t have better algorithms at Google, we just have more data.”

Malware Data Can’t Train Network Security AI

So, it’s clear that for supervised ML to operate at its best, it needs lots of data. I think the interesting question is, where do you get this data? That’s not obvious.

Good training data is hard to get, especially in the network security space. When you look at the endpoint vendors, for example, they are certainly training their ML systems, but they are doing so based on malware samples. They just have to get 100 million malware samples, which is not so difficult to do. There are large repositories, such as VirusTotal, that provide access to a huge volume of malware samples. Then, you can throw them into your AI system, and you can get good classifiers and good detection.

On the network side, however, that’s much harder. Where do you get examples of bad network traffic? How do you get command‑and‑control traffic, for example, or exploit traffic? One naive approach would be, “Well, I have millions of malware samples. I’m going to run all of them; I’ll just take all the network traffic that the malware produced in my sandbox.”

The problem is when you detonate a piece of malware, it does connect to bad sites but also to good sites. It does a lot of things that have nothing to do with its malicious traffic, with the command‑and‑control, exploit or exfiltration activity. For example, it might connect to Google to check Internet connectivity. Or it might connect to legitimate mail servers to send spam. If you take everything that a malware connects to and use it as examples of malicious traffic, you will pollute your training sets significantly.

Also, in many cases, the malicious infrastructure is already down by the time you run your malware analysis, so you don’t even get a command‑and‑control connection. And how do you observe human‑driven lateral movement? How do you get training data en masse?

Many network security vendors have a difficult time getting a large amount of high‑quality, labeled network data. They might have a threat intelligence team that manually creates some data sets, but these are typically falling far short of the volume needed, which gives you weak AI models.

The Lastline Advantage

Lastline has an advantage here because of our network sandbox that provides us with a lot of high‑quality, labeled data that is produced every time we detonate a sample. We don’t need manual labeling. Our network sandbox provides deep visibility into network activity and can precisely label all activity. It understands which are the command‑and‑control connections, which are the lateral movement connections, which are the exploits, and what is noise. And we’ve been at it for over 10 years, so we have amassed a huge volume of this training data.

Our analysis system automatically separates the chaff from the wheat to give our AI models millions and millions of automatically‑labeled, high‑quality input data for our AI models. As a result, we can train our supervised ML algorithms much better than anyone else, resulting in much more accurate threat detection with minimal false positives.

Vendors that don’t have access to labeled network data or are relatively new at training their AI systems simply don’t have the volume of data that’s available to Lastline. Just as shown in the chart at the beginning of this post, their accuracy will be lower. So, be sure to ask prospective network security vendors how they get enough labeled network data (and not just malware samples) to train their models.

So far in this series, I’ve discussed the importance of supervised ML, the need for non-linear modeling, and the importance of huge volumes of training data. Once you have these in place, what’s needed to actually see what is potentially malicious? You actually have to see the right signals in your data, which is the subject of my next post.


Earlier blog posts in this series:

5 Truths about AI in Cybersecurity – Truth #1: Anomalies Aren’t (Necessarily) Threats

5 Truths about AI in Cybersecurity – Truth #2: The World is Too Complex for Linear Classifiers

The post 5 Truths about AI in Cybersecurity – Truth #3: Good Training Data Can Be Hard to Get appeared first on Lastline.

Article Link: https://www.lastline.com/blog/5-truths-about-ai-in-cybersecurity-truth-3-good-training-data-can-be-hard-to-get/