Its been about two weeks since we did the AntiVirus FightClub and its been an interesting experience. (background: here, here, and here) Below are some of the thoughts and conclusions. (flamesuit on!)
Real-world performance
Some within the AV industry questioned the results of the testing often citing the size of the in-the-wild test set, among other things. Herein may lie the reason many have found the public antivirus tests don’t reflect real world performance. Allow me to explain with a rather silly analogy.
If I have an ant problem in my house (which I do), I could get some ant spray and try to kill them. If I go to the store and find two available ant spray brands, one which claims to kill 97% of all ant species even the ones they genetically engineered in the ant lab – a test of over 100,000 species. The other one claims to kill only 75%. I’d choose the 97% brand. Next, I discover that the 97% brand only kills 3 of the 6 different ant species that live in my area, but the 75% brand kills 5 of the 6 species in my area. Clearly the test results have led me to buy a brand that isn’t best for me – even though those tests may be totally valid. The problem lies in that the test represented 100,000 viruses, of which only 6 really mattered to me.
Many would point out that this metaphor is flawed – viruses aren’t ants. Unlike ants, you can’t predict which viruses you will encounter in your area and that coverage tests are predictors of how well your virus scanner will behave on viruses you will encounter. However, as the fightclub showed, the performance on viruses that came to our email honeypot was not at all predicted by these coverage tests. I have many hypothesis for the reasons we’re seeing this, however given the lack of transparency on the public coverage tests it is hard to draw any conclusions (a simple excel or csv of results would help).
I’m not proposing the antivirus fightclub is a better test than these coverage tests; its just evidence of that the real world performance of some of these scanners aren’t well predicted by coverage tests, and in some cases are worse than the larger public believes.
A different test methodology
I think a better predictor of real-world performance would be past real-world performance. This can be measured automatically, in real-time, without involving humans. To use my silly analogy, this would be the equivalent of setting up an ant trap and testing each ant that walks in the door on all the different ant sprays and recording the results.
The implementation is simple: set up a email honeypot that feeds every email into all scanners and records the results. Each email (sample) is then placed on a queue to be verified. If it is later verified as malware, it is counted in the results, otherwise it is not. Samples could be verified by humans (or a set of humans) or by consensus among scanners – if x number of scanners deem it to be malware a few weeks later it could be counted.
This is almost identical to a test run here among other tests – except samples are drawn from sources that are more representative of an average internet users. If you want real-world performance you can only test ants that walk in the ant trap; you can’t seek out new ant species and throw them in the test. Also, the code for the test also needs to be open source so several people can run the same test independently.
There are flaws in this test. It only tests viruses that come in by email, which is only one vector of infection with a different population of malware than other infection vectors. The person who knows the email address or domain of the honeypot can poison the test, but this can countered by open sourcing the code and allowing multiple parties to independently run the same tests.
As always, I’m interested to know what people think of this.
Lessons from AV FightClub
The biggest observation is that the world is quite split on how they view antivirus testing and performance. Of the people I’ve had contact with, one group found this whole experiment unsurprising. Many had run similar experiments or just believed from personal experience that the real-world antivirus performance had no similarity with the public coverage test results (mostly around clamAV). The other group found this test flawed on many levels and discarded it because of the methodology or objectivity of myself. The surprising thing about it was that there seems to be little community desire to unify the two groups by discovering where the root differences lie.
The other lesson I learned was that the the method for interpreting the results should also be posted with the results. Inevitably, vendors who don’t do well will believe the test flawed and some people will still misinterpret the results, but a best effort should be made. While I made every effort to communicate the purpose and scope of the AV FightClub, it was clearly still miscommunicated.
10 Responses on AntiVirus FightClub conclusions
> Allow me to explain with a rather silly analogy.
Yes, it is a little silly. It assumes that your test is as valid as anyone else’s. Since you’ve carefully avoided answering any awkward questions about your methodology, I guess even you don’t think that’s the case.
Having had a substantial amt of experience working with real-world security problems for clients and various solutions, I can’t say I see a lot of value in any of the testing methods I’ve seen, in particular those like the PC Mag “shootout” cited in another post which is also the most common. I have several issues with *all* these tests:
First, there is no weighting for probability. An infection floating about in the wild in small numbers which is highly unlikely to reach my client should not be considered an equal threat as Storm. Yes, ideally I want any and all tools to defend against any and all infections, but while signature base defense is still primary and heuristics and intrusion detection is still immature, it is therefore of more importance to be able to defend against those infections with highest probability of attack.
Second, and more importantly, there is no weighting for severity. There are still thousands of infections in the wild which are at most a small nuisance. And again while I expect my tools to defend against such, in a numerical testing to score detection of the innocuous as equal to defending against a botnet trojan or rootkit is absurd. Clients have paid me a lot of $ removing that infection which fell into the 3% his AV tool couldn’t detect. What good is it to have a high score against the many weak infections but be unable to protect against the few which are extremely destructive?
Third, and a real serious issue, is that of removal as opposed to just detection. I began dealing with this when polymorphic trojans emerged. Of course, signature based engines are often unable to detect these types of infections, as well as rootkits. But even assuming detection, the multiple and amorphous nature of infection’s installation makes the actual removing of the infection extremely difficult. Consequently, defenses must be multi-layered and capable of before-the-fact prevention, hence host and network intrusion detection, permissions profiling (ala SELinux & AppArmor), sandboxing, etc.
I’ve just begun looking at the untangle suite. I’ve found ClamAV to be an adequate file signature based defense on the common email and file download attack vectors, but AFAIK it cannot detect polymorphisms like Kaspersky can. I do see that untangle includes both network and host (at the gateway) intrusion detection based on behavioral patterns (“signatures” is a misnomer here). However, it would appear the client-based protection tools are still needed, particularly on Windows machines. Unfortunately, the strongest of such type tools are only available on Linux.
Hi Dirk,
I actually think that ‘response times’ are as important as ‘detection rates’ – if not more.
The skill and speed of the av researchers response team is critical nowadays.
David
looks like commtouch implemented something similar as described here:
http://www.commtouch.com/Site/ResearchLab/VirusLab/recent_activity.asp
While your method tested may be elementary in view of some critics, but there is another counter view “If these antivirus products cannot satisfy the elementary tests, can they satisfy the realworld tough big tests ?” ,
anyway, you are not hiding your methodology, you are as open as opnesource. Good work! keep it up!! bring back the fight club again! get a team and let the members pitch their own products and their configurations!!
Hi, this would be nice if some of the free ones were included and Trendmicro. Would also like to see total throughput time with all the bells and whistles VS other firewalls (Smoothwall, endian…)
In our throughtput testing done on the same hardwar the Untangle average was slower than other because of all the checking it has to do. When you combine multiple tasks on one box the throughtput usually slows. I want to see which ant goes faster, fire, red, black ….
This test is really about the AV not about firewall performance. Can get those AV without untangle, Clam comes on most open source firewalls now. Nothing special.
SO: how about some “real world” performance on the firewall not the AV. Untange makes me think of the all in one printers, fax, scanner, TAD and coffee maker.
Well congrats for your good experience, Your last post was very strong, and conclusion is really superb. I had such problems with my older anti spyware softwares and that was really headache. But from when I got http://www.search-and-destroy.com software, It is really good to manage.
Have a good day my Friend!
Completely agree with you about posting your methodology with your results. That will keep away a lot of skepticism.
Excellent conclusion.
I find Spybot search and destroy to be good at spyware removal, its free and can imunise your pc before it dets infected. Nice informative post you have written there. Added you to my rss reader
Leave a comment on AntiVirus FightClub conclusions
RSS feed for comments on this post · TrackBack URI