imagesthumbnail1.jpegIts been about two weeks since we did the AntiVirus FightClub and its been an interesting experience. (background: here, here, and here) Below are some of the thoughts and conclusions. (flamesuit on!)


Real-world performance

Some within the AV industry questioned the results of the testing often citing the size of the in-the-wild test set, among other things. Herein may lie the reason many have found the public antivirus tests don’t reflect real world performance. Allow me to explain with a rather silly analogy.

If I have an ant problem in my house (which I do), I could get some ant spray and try to kill them. If I go to the store and find two available ant spray brands, one which claims to kill 97% of all ant species even the ones they genetically engineered in the ant lab – a test of over 100,000 species. The other one claims to kill only 75%. I’d choose the 97% brand. Next, I discover that the 97% brand only kills 3 of the 6 different ant species that live in my area, but the 75% brand kills 5 of the 6 species in my area. Clearly the test results have led me to buy a brand that isn’t best for me – even though those tests may be totally valid. The problem lies in that the test represented 100,000 viruses, of which only 6 really mattered to me.

Many would point out that this metaphor is flawed – viruses aren’t ants. Unlike ants, you can’t predict which viruses you will encounter in your area and that coverage tests are predictors of how well your virus scanner will behave on viruses you will encounter. However, as the fightclub showed, the performance on viruses that came to our email honeypot was not at all predicted by these coverage tests. I have many hypothesis for the reasons we’re seeing this, however given the lack of transparency on the public coverage tests it is hard to draw any conclusions (a simple excel or csv of results would help).

I’m not proposing the antivirus fightclub is a better test than these coverage tests; its just evidence of that the real world performance of some of these scanners aren’t well predicted by coverage tests, and in some cases are worse than the larger public believes.

A different test methodology

I think a better predictor of real-world performance would be past real-world performance. This can be measured automatically, in real-time, without involving humans. To use my silly analogy, this would be the equivalent of setting up an ant trap and testing each ant that walks in the door on all the different ant sprays and recording the results.

The implementation is simple: set up a email honeypot that feeds every email into all scanners and records the results. Each email (sample) is then placed on a queue to be verified. If it is later verified as malware, it is counted in the results, otherwise it is not. Samples could be verified by humans (or a set of humans) or by consensus among scanners – if x number of scanners deem it to be malware a few weeks later it could be counted.

This is almost identical to a test run here among other tests – except samples are drawn from sources that are more representative of an average internet users. If you want real-world performance you can only test ants that walk in the ant trap; you can’t seek out new ant species and throw them in the test. Also, the code for the test also needs to be open source so several people can run the same test independently.

There are flaws in this test. It only tests viruses that come in by email, which is only one vector of infection with a different population of malware than other infection vectors. The person who knows the email address or domain of the honeypot can poison the test, but this can countered by open sourcing the code and allowing multiple parties to independently run the same tests.

As always, I’m interested to know what people think of this. :)

Lessons from AV FightClub

The biggest observation is that the world is quite split on how they view antivirus testing and performance. Of the people I’ve had contact with, one group found this whole experiment unsurprising. Many had run similar experiments or just believed from personal experience that the real-world antivirus performance had no similarity with the public coverage test results (mostly around clamAV). The other group found this test flawed on many levels and discarded it because of the methodology or objectivity of myself. The surprising thing about it was that there seems to be little community desire to unify the two groups by discovering where the root differences lie.

The other lesson I learned was that the the method for interpreting the results should also be posted with the results. Inevitably, vendors who don’t do well will believe the test flawed and some people will still misinterpret the results, but a best effort should be made. While I made every effort to communicate the purpose and scope of the AV FightClub, it was clearly still miscommunicated.