TechRepublic’s Karen Roby spoke with Chris Ford, VP of product for Threat Stack, about supervised and unsupervised machine learning. The following is an edited transcript of their conversation.
SEE: Hiring Kit: Video Game Programmer (TechRepublic Premium)
Christopher Ford: Supervised and unsupervised learning are techniques that help to facilitate different use cases within the sphere of machine learning. As your viewers know, machine learning is used to gain insights out of data sets. You’re either organizing data or making predictions about data. I would say that the crucial difference between unsupervised learning and supervised learning is that the former, unsupervised learning, it’s easier to get started with because it does not require labeled data.
In the machine learning world, labeled data is data that you, as a human, go through and describe to your machine learning system. Unsupervised learning does not require that. Generally, unsupervised learning is used to infer the structure of a data set that you give it. Unsupervised learning has roots in cybersecurity, which is my space, in doing anomaly detection. It uses clustering techniques to look at data and group it largely to answer the question, is this behavior that I’m looking at normal or is it anomalous.
Supervised learning, on the other hand, is kind of like starting with the answer. In that supervised learning requires labeled data and lots of it. As it turns out, the supervised learning algorithms are somewhat simpler than unsupervised learning. But the real challenge in using supervised learning is that there’s such a dearth, or a lack, of labeled data. You need a lot of data and you need it to be well labeled in order for supervised learning to work.
Supervised learning, it can be very powerful in that it allows you to do classification. I’d be happy to talk through some of the applications for unsupervised learning and supervised learning in cybersecurity. But with supervised learning, you can do classification, but you can also make predictions about data. As I think we’ll soon discuss making predictions about data, we think, is the next frontier in terms of identifying risk in your infrastructure.
SEE: Digital transformation: A CXO’s guide (free PDF) (TechRepublic)
Karen Roby: Talk a little bit further about machine learning and security.
Christopher Ford: Machine learning is not new to cybersecurity, first of all. It can be very powerful. Now, I think since late ’80s, early ’90s actually, unsupervised learning techniques have been used in a variety of applications like intrusion detection, whether it’s network-based intrusion detection or host-based intrusion detection. When applying unsupervised learning to those problems, essentially what you’re doing is saying is this network connection, or is this user behavior good or bad?
Good versus bad is a difficult question to answer. It’s more appropriate to say normal versus unusual or normal versus abnormal. Unsupervised learning was used for many, many years and still is in those sorts of applications. Supervised learning came into prominence as a tool for security practitioners in the areas like where classification is needed. Supervised learning is used for things like URL filtering, identification of spam, antivirus. It can be very effective in those use cases.
Karen Roby: Chris, when we talk about best practices and for incorporating machine learning into a bigger strategy, an overall strategy, what would that look like and what kind of advice can you pass on?
Christopher Ford: I’ll first start with the challenges I think that both of those technologies face and where I think we’re headed. Then I have some advice, practically speaking, for someone who wants to get started with some of these technologies. First off, machine learning is really meant to automate a lot of human-intensive processes. When answering the question good or bad, it’s often not clear what’s good or what’s bad.
If you’re talking about things like a virus or a connection, that can be more straightforward. But as infrastructure changes, as the way we develop software changes, the world has become incredibly complex and layered and very dynamic. You have workloads now that are up for a matter of seconds in some cases. It is that ephemeral nature and that complexity that makes it difficult to say, “This behavior is good,” or “This behavior is bad.”
Even answering the question, “Is this normal or not?” doesn’t really give you great insight into whether or not there’s an active threat or a risk. I like to say that one organization’s normal behavior could be considered quite bad for another organization, and something that’s unusual in one customer environment, it may be unusual, but it may not be harmful. Using unsupervised learning for anomaly detection is coarse-grained at this point.
You still end up with a lot of findings to come through as a security analyst. That’s the real challenge. Supervised learning, on the other hand, as I said earlier, it can be very effective in doing classifications, but the availability of good, labeled data at scale to train your models to identify certain behaviors, it just isn’t there yet. Where we at Threat Stack see the market is going, is toward combining those sorts of techniques, unsupervised learning and supervised learning.
SEE: How to do machine learning without an army of data scientists (TechRepublic)
Think of it like detection in depth. You hear people talk about, “defense in depth.” This is detection and depth. Both of them have their strengths, but it’s really when you put them together that you can get something meaningful out of it. Remember I talked about the decision you’re making between good and bad, unusual or normal. What we see as the next layer in our detection in depth strategy is, “OK, was it predictable or not?”
If you see a behavior and you answer the question, “We could not have predicted that,” then that to us is a flag that there’s something extremely unusual, that isn’t normal for you and represents a significant amount of risk. We’re advocating a combination of detection mechanisms, classification, clustering and regression for doing predictions. Those predictions, they tell you, “Hey, is this behavior something that we reasonably could have predicted based on what we’ve seen already?”
If you’re looking to get started with all of this, I have some cautions and some recommendations. The caution, first, is be skeptical. Machine learning has a lot of buzz, and it’s well-earned, but machine learning often promises magic. I would be skeptical of solutions that promise to give you full detection, lower the number of findings that you have to sift through in a day, because those things can be at odds sometimes. We like to say, it’s like snipping the wires on your check engine light. You certainly won’t have that light bothering you, but it doesn’t mean there aren’t problems that you need to be looking at. Be skeptical.
But once you’ve said, “All right, I want to invest in machine learning as a way to identify risk”, then I would look, number one, for either solutions that are commercially available, or if you want to roll your own, think about combining detection mechanisms in a way that they work together. If you do have the inclination to invest in your own machine learning solution, I would say maybe rethink that first. There are plenty of good off-the-shelf solutions that have models already built that can leverage massive amounts of data that they’re collecting across tenants in their platform. That’s often a good starting place.
But if you want to invest in it on your own, I would say don’t forget about data engineering. We talk a lot about data science, because that’s, I think, a little bit more sexy. But data engineering is absolutely critical. If you want to do things like predictions and classifications at scale, you’ve got to make sure that you’ve got lots of data, that it’s well prepped for machine learning and that it’s labeled properly. Data engineering really forces you to identify, hey, what is my objective? What am I trying to get out of this?
The other thing, the last thing I would say about either commercially available machine learning solutions or ones that you build yourself is context really matters. Beware the black box machine learning. If you’re not sure why a particular model, say you’re using deep learning to identify risk, if you don’t know why a model surfaces something it’s really hard then to go and investigate it. Choose models that are easily explainable so that you actually know why the technique or the technology is surfacing risk.
It is that transparency into how the model works that ultimately allows you to tune that model as well because every single organization is different. Look for solutions that allow you to take input from humans or learn over time so that you start to establish this virtuous cycle. The more data you capture, the more findings you generate, the more input you get from the people that are looking at those findings, the better your system gets over time.
Subscribe to TechRepublic’s YouTube channel for all the latest tech information and advice for business pros.