Inform interview
A student using her laptop
Creative Commons attribution information
©Rawpixel via iStock
All rights reserved

Machine learning v malware: is big data the new kid on the cyber security block? - Miranda Mowbray

Data scientist and Networkshop keynote speaker Miranda Mowbray explains how finding patterns in large data sets may offer a huge step forward in tackling network attacks.


In this interview she also considers the particular security challenges posed by the Internet of Things, the ethical issues around big data analysis, and argues that we should not blame users for their poor password choices.

Miranda Mowbray

You use big data to find attacks on computer networks. How? And how does it improve on how people were doing it before?

In general, the traditional way of finding attacks on computer networks is to identify the “signature”. So, for a particular attack, you know that there is a particular sequence of bytes that’s a signature of that malware, or the signature is a little more behavioural, for instance that a certain number of bytes are sent to a particular port. Or the malware uses a particular domain to communicate between the infected machine and the malware.

But this is a rather fragile way of detecting because if the malware designer manages to change the signature and slightly upgrade the attack, then it suddenly becomes invisible. Malware designers have found ingenious ways to design malware so that they give different signatures every time - the malware itself mutates.

However, if you collect lots of data from the network and use data analysis on it, it’s sometimes possible to spot patterns. So, for example, although the domain will be different every time, you see patterns in the set of domains that are picked by the random generator within the malware. And if you can see this happening in ways that are consistent with the attack you can do larger-scale pattern spotting and use that to detect malware that mutates.

One way that we did this was to detect domain fluxing algorithms. This is a technique in malware where the domain that is used to connect between the malware controller and the infected machine is different each time. But there are patterns in the features of the domains that they try to connect to. We weren’t the first people to use data science to find these patterns, but we designed a detection method that, from five days' data from five different months in a large enterprise network, uncovered 19 families, nine of which we hadn't seen before. The previous record was in a paper that reported six previously unknown families detected in 18 months' data, and most previous research papers on this topic reported just one new family.

Are there any downsides with using data science for the detection of network security issues?

Yes. There are several big ones. The first is false positives. We’re looking at billions of events per day and if you have one chance in a thousand of getting a false positive, then that's millions of false positives. That’s not good enough – the number of false positives has to be kept low. There are various ways in which you can do this. Generally, you do it as a trade off between false positives and false negatives in that you’re prepared to miss things, but there are also other techniques. For example, rather than immediately ban something from your network, you may quarantine it so that the stream connected with it cannot explore the full power of your network, like get into more sensitive databases, for example.

Another thing you can do is delay and collect more information. So you say, “this looks suspicious, we’re going to delay its behaviour, observe it some more and see whether we can be more sure whether it’s actually bad news or not”. That may result in a slight delay but, when we tried it, users didn’t actually notice anything at all.

Another option, and the point at which users might become aware, is if you decide to set up automatic send-outs of notifications to users saying “we’ve noticed something suspicious and here’s what you have to do”.

An inherent problem with cyber security from a data science perspective is that, unlike other areas of data science, there is an adversary. You’re not trying to find general patterns in nature or the universe or a business, there is someone who is actively trying to fool you – and that makes it more challenging and more interesting. You have to make the detective methods you develop expensive for your attackers to get around, or slow them down. You have to think always about how they might be circumvented. That’s pretty cool and fun.

Another issue is that because you’re looking at a very large amount of potentially sensitive data, keeping this data private and secure is really important and that’s less of factor in some other areas of data science where the data is public and not particularly sensitive.

A further issue is the frequency of the true positives. Supposing one in a million of the billion events you look at is associated with an attack, that’s a thousand a day. Ringing an alarm bell for each one of those will not go down well with your security event team. So you have to collate the data and show it in a way that’s more helpful and easy to manage for human beings. If you can work out if there are things happening that are all associated with the same attack, you can report them together, ringing just one alarm bell.

What added issues does the Internet of Things bring?

In one sense, the Internet of Things is no different from what we had before – it’s hardware plus software plus networking plus applications, and a security problem in any one of those is a security problem for the whole thing. But what’s unusual about the Internet of Things is that this is a new industry with a lot of new companies whose main focus is not going bust. They have a limited amount of time and venture capital to bring to market a product that actually functions and they are concentrating on that. Anything that will delay their time to market or increase their cost per device is going to be treated as an area where they may have to make cuts – and one of these areas is security and privacy. The issue with privacy is that it may be that, for the business model, the whole point of the device is to collect as much data as possible.

There’s also an issue that this is mainly small companies where the manufacturing chain may be very long and complex. It may involve teams from small startups in seven different parts of the world, where none of the teams have a security expert. And interactions between what any of these teams do might cause a security problem.

Another issue is where an object is designed for offline use and then put on the network. It may be very securely designed and be fine as long as it’s offline but, as soon as you hook it up, all sorts of new problems emerge. For example, there are doll manufacturers that are internet-enabling their dolls and there have been vulnerabilities discovered in these. They may be designed to be safe and work well as offline dolls but once you put them on the network there’s an issue. 

How can we move forward in making people actually care about security? Is it down to design processes, business models or user mindsets? (Or all three…)

It’s all three. For example, with the Internet of Things, as well as the design of the technology, one issue that can easily be addressed is that some Internet of Things start ups don’t have good processes in place. They want to do the right thing but, for example, they do not have good procedures for responding to a vulnerability report.

There was an investigation into the security of baby monitors, which found difficulties, which is not really surprising, but what did surprise me was the very inadequate response of some of the companies to the vulnerability report. They didn’t seem to have any processes in place for dealing with it. There is an exception – Philips was exemplary, which shows that this can be done.

As for mindsets, there’s a terrific paper by Anne Adams and Angela Sasse at the University of London called Users Are Not the Enemy and the idea is “don’t blame the user”. When people talk about changing mindsets, sometimes what they mean is “those blasted users, they behave insecurely”. Adams and Sasse looked at some examples where this was being said and found that it was set up so that it was almost impossible for the users to behave securely!

I did some research with a different team on what makes people more likely to share online. We found that defaults are a very important driver. If you make things easy for people to do, then it’s more likely they will do them, particularly if you can make it the default. There are some examples in the Internet of Things where the default is insecure and the users have to do something to make it secure. That should not be the case. If a customer insists on using "password" as a password, there should be something that says, "we don’t allow that".

Generally, before you do user education, you should do everything else first. If you have a problem with users, it’s probably evidence that your design and your architecture is not good enough. Once you’ve improved that, then you can do user education.

What would you say is the balance between attack and defence? Who is winning?

I don’t think it’s a case of “winning”. I see the network ecosystem like a biological ecosystem. Most people who use any kind of network are good, but there are a few people who are out to exploit that: they're parasites. But it’s not in the interests of the parasites to kill off the host, to kill off the ecosystem.

We are always going to have people designing malware in order to make money, it’s likely that they will continue to be able to make some money but I don’t think that they are going to be able to close the whole thing down. I don’t think anyone can ever win, but I don’t think we’ll lose.

How might we go about understanding the 'thought process' of a machine learning algorithm?

There is an issue with some types of machine learning in that they are very opaque. So, for example, I could tell you all the features that are used in one of my algorithms and the weights that occur from those features. But the weights depend on the recent data that’s come in, so tomorrow it may be different. I can give you a full description of the algorithm, I can say everything it does but that’s not really explaining the thought processes. You don’t necessarily have an insight into what it’ll do tomorrow. And it may surprise you!

However, there is a lot of work being done in making machine learning less opaque, and I do think there is scope for it. For example, one thing you can do is find the top five features that are most salient in a classification algorithm. If I'm classifying something as malware, or not, or infected or not, I can tell you the features and their relative weights, and I can say how these have changed over time. I can also give you typical instances of where it has classified something as malware or not, or something that it wasn't quite sure about and it plumped for not-malware, and these were the weights that motivated that decision.

There's also some very nice work by Cynthia Rudin and her team on interpretable models: if you're learning from training data with 100 features, instead of looking for a model that uses the values of all 100 features, you can look for the best predictive model that only uses a small number of the features, and which evaluates an input just with a simple scoring system for these features. They've shown that for many applications you can find simple, easily explainable models of this kind that are just as accurate as the ones found by more opaque processes.

In the particular case of deep learning, which tends to be completely opaque, there is work being done to make it less so. Deep learning produces different abstraction layers for data, it finds features of a data set and features of those features and features of those features of those features. You can find out what high level features it's discovered, and that may give you some sort of insight into what it's doing.

There’s an international shortage of data scientists. How do we fill that gap - starting early with schools and STEM subjects? Encouraging more women into the field? Or will automation fill the gap for us?

Automation will be part of the solution but it is not enough. We can automate to a certain extent, and it's right that we do that, but there will be a continuing need for data scientists.

To do data science you need the technical bit, you need the hacking bit, but you also need the domain knowledge and so it really is an art as well as a science.

The appeal of the field has changed quite a lot in just the last few years because of the amount of publicity around how much money data scientists can make. One result of that is a bunch of people who do not have the maths background but are good at the hacking are getting into data science and that's not necessarily the ideal route, because it's easy to do it wrong. Data science involves a collection of skills that, from an educational point of view, we're not really educating people to have together.

As well as the analytical, statistical and hacking skills, there are domain skills, so a data scientist in agriculture might be rather different than a data scientist in business. But an absolutely crucial requirement is the ability to communicate your results clearly in a way that a layperson can understand but does not traduce the science. And that is something that, traditionally, computer scientists aren't given much coaching on.

What has your research told you about ethical issues in big data analysis (particularly in an educational context)?

My work has been in the corporate context but one thing that I've been impressed by is the seriousness with which universities take ethical issues around the guardianship of big data and how they do that better than the commercial world, in my opinion. So when I was looking at codes of practice I generally found better ones from academic institutions or scientific bodies than from the commercial world. Having an ethics board is absolutely normal in universities, it's rarer in industry.

I was doing research on more than one large network so, in theory, I had access to very large amounts of network data that hadn't been put through any obfuscation or anonymisation process. As an experiment, I got a colleague to pseudonymise some data that I was allowed to look at, so he replaced each identifier by a pseudonym, and then I had a look to see what I could find – and I found out some pretty sensitive personal things, some pretty sensitive corporate information, and it spooked me. My project already had a code of practice but it was a bit dusty so I did a complete overhaul.

More recently, there's been a framework for the ethical use of big data analysis brought out by the Home Office and the Cabinet Office and I gave input as part of the advisory board. They workshopped the first draft with members of the public so I answered questions from the workshop participants about data science. That was fascinating because it turned out the sorts of things that the members of the public were concerned about were different from what the experts were concerned about and it wasn't what we predicted.

For members of the public generally, what they cared about was what their data was being used for. They wanted it to be used for something of public benefit or personal benefit to them and they wanted some assurance that it would actually work, that it would be useful. That mattered more than the details of how we were looking after the data. They were ok with us doing things that would have been a big no-no for me, provided that it was in a good cause and would be effective. It was all very pragmatic, very sensible – and very hard to translate into technical rules.

Miranda Mowbray's research has included work on machine learning applied to computer network security, and ethics for big data analysis. She was previously at Hewlett Packard Enterprise Labs, finding new ways of analysing data to detect attacks on computer networks. She is a Fellow of the British Computer Society.


Miranda at Networkshop45

Miranda gave the Networkshop keynote presentation on 11 April 2017 at 14:45 on machine learning for network security.

See the programme for full details and session resources.