False positives are not all equal. Or always real false positives!
Security tests ought to test for ‘false positives’. It’s important to see if a security product stops something good on a customer’s system, as well as the bad stuff.
Measuring the balance in security
Almost nothing in this world can be reduced to ‘good’ or ‘bad’ accurately. There is too much subtlety: what’s good for one person is bad for another. Someone else might feel neutral about it, or slightly positive or negative. The same applies when testing security products. It’s rare to get a straightforward good/ bad result.
An anti-malware product might block all threats but also all useful programs. It might ask the user frequent and unhelpful questions like, “Do you want to run this ‘unknown’ file?” Alternatively, it might let everything run quietly. Or prevent some things from running without warning or explanation. Maybe you want to see alerts, but maybe you don’t.
We look at how to put the nuance back into security testing.
In this article we’re covering how to classify sub-optimal results. False Positives (FPs) are a part of this, but they are not the full picture.
We will see how security products typically classify the objects that they encounter, and how they handle them. They might ask the user questions and delegate some of the responsibility. Alternatively they might take action without producing an alert. There are pros and cons to many different approaches and a security tester needs to be aware of this and make some accounting for it in a report. Knowing how to test for ‘false positives’ is extremely important.
Common classifications
Newcomers to security testing may assume that products are right or wrong when they encounter a file. If it’s malware then it should detect malware. If it’s a legitimate application like Microsoft Word then it should allow the user to continue uninterrupted. Surely that’s common sense?
A successful malware detection is called a True Positive (TP), while incorrectly detecting a good file as malware is a False Positive (FP).
You want TPs and not FPs. In other words, you want your anti-malware product to detect and block viruses but not PowerPoint.
You could also have a situation in which the product successfully categorises a good file (a True Negative) and fails to detect malware (a False Negative).
You can summarise these basic outcomes like this:
Actual and Assigned Categories | Malware | Legitimate |
Classified as malware | True Positive | False Positive |
Classified as legitimate | False Negative | True Negative |
In the ideal world there would only be True Positives and True Negatives. Arguably the worst case scenario is a False Negative, where your security product classifies malware as being legitimate. Your system is infected and you don’t know about it.
If anti-malware products gave straight-forward results like, “This is malware” and “This is legitimate” then we could stop this article now. But they don’t, usually.
When a user downloads something that the anti-malware product is not 100% sure of, it may use a less definite classification such as:
- Risky (in some unspecified way)
- Rare (so risky)
- Not trusted (in some unspecified way)
- Harmful (this is practically equivalent to a ‘malware’ detection)
- Attempting to access the internet (without any useful context)
- Attempting to change system settings (without any useful context)
Classification and action/ interaction ranges
As we’ve seen above, somewhere between “definitely malware” and “definitely safe” we have a range of classifications that span unknown to unwanted.
After the classification stage there is an action that occurs. Usually a malware detection will cause the program to be stopped, or blocked. A ‘safe’ judgement will allow the program to run, often without an alert – because telling users that a good program is good every time would be very annoying.
And that brings us onto the next subtlety. A security product success story isn’t just about accuracy. It’s about user experience too. Testing needs to take that into account or the ‘best’ products could be the ones that annoy the users the most.
User interaction is an important piece of the puzzle.
Now we have Classification, Action and possibly Interaction on the table.
When a product isn’t sure it tends to ask questions. The user then becomes the judge and jury (and potential victim of a malware attack or a blocked spreadsheet). The question is usually the same – Do you want to allow/ run this program? But the accompanying advise can vary. Generally you can boil it down to three options:
- Do you want to allow this program? We recommend you do.
- Do you want to allow this program? We recommend you don’t.
- Do you want to allow this program? We don’t have a recommendation
There are now five likely outcomes. The product will allow the program, block it or ask a question and maybe make a recommendation.
Putting it together
One one hand we have the classification that the product makes (‘this is malware’, ‘this is legitimate’ or ‘I’m not sure’). On the other we have the interaction with the user (‘what do you want to do now?’) This is a useful tool when working out how to test for false positives.
We can plot this out on a matrix as below:
When a security product reacts to a new program it will make a classification and perform some level of interaction. In the illustration above you can see how to record when a product encounters an ‘Unknown’ application and asks the users if it should be allowed or not, with the recommendation to ‘Allow’.
Of course, a tester should then identify if the application has really been allowed to run fully, or if it’s been somehow restricted. Similarly with a malware detection, if the product claims to have stopped it, the tester needs to verify that and not just take the product at its word.
How SE Labs tests with legitimate software
SE Labs uses this method of measuring performance with legitimate applications. We rarely see genuine FPs, in which a product falsely condemns a legitimate application as malware. Usually we see what we call a Non-Optimal Classification or Action (NOCA). This is what we measure on a chart like that above and score accordingly.
Here’s an example from our 2021 Q4 enterprise endpoint protection test:
In the above data we would look at the Classification and the Action/ Interaction and work out the score. Allowing a Safe object scores two points. If it’s a very prevalent file than we multiple the score to reach the final rating. If it’s quite a rare file then the multiplier is lower.
Similarly, marking a prevalent legitimate file as Malicious and blocking it outright would bring a penalty of minus two with a large multiplier.
False positives are a sub-set of Non-Optimal Classification or Actions.
For nerds: NOCA ⊆ FP
It’s all very well knowing how to test for ‘false positives’ but what about all of the other possibilities?
Classification semantics
In our scoring table above we have six different classifications, but every anti-malware product uses slightly different language. For that reason we use the following internal guide to help us stay consistent when testing different products.
Alert | Classification | Why? |
Risky | Object is Uunwanted | We don’t care about the word, “rare”, just the word that describes how desirable the file is (“Risky”, in this case). |
Rare (so risky) | Object is Unwanted | As above |
Rare | Object is not Classified | As above |
Not trusted | Object is Suspicious | Things are suspicious when you don’t trust them. |
Harmful | Object is Malicious | Any description that directly claims that the software is somehow “dangerous” counts as malicious. |
Threat (e.g. Generic_Virus_RAT) | Object is Malicious | Whether it’s a malware definition or ad-ware, for our purposes it’s malicious. |
Program.exe is (doing something) | Object is not Classified | It might be “acting as a server”, “wants internet access” or could even be “trying to access your online bank”! We don’t care! If the security product makes no judgment on the classification then we don’t either. |
Other things to consider
Our general approach in this article does not cover other important questions such as:
- Which files and other objects (URLs etc) should you choose?
- How do you introduce those objects into the test?
- How do you judge prevalence of those objects in the real world?
- How do you verify that your ‘good’ objects are completely harmless?
All of these are important considerations and each tester needs to at least explain how they addressed them, even if the explanation isn’t perfect. Only then can a consumer of their reports understand what’s been done and how useful that information is to them.
We hope that this article about how to test for ‘false positives’ inspires anyone interested in security testing to take a rounded view on how products work, and not just a pass/ fail, detect/ didn’t detect perspective.