I think we can all agree that Yahoo has really had an off decade (or so). Most recently, reports revealed that, basically, Yahoo's security mechanism was at best an honor system and at worst a giant fraud. This is only the latest major uh-oh in a string of them.
The crazy thing is that most cracking instances are either the result of not keeping up with patches or boneheaded programming errors that allow code injection, SQL injection, and cross-site scripting. This happens over and over and over again. Avoiding these problems is easy: All you need are good coding and QA practices.
But many organizations don’t do that. Instead they move development to “low-cost countries” and treat attacks as a sort of rare, 100-year weather event they can’t avoid or afford to mitigate. Thus, it happens over and over again.
In the last few years, some attempts have been made to use machine learning tools to “predict” vulnerabilities in code. The results so far have been hit and miss. Microsoft Research went so far as to write a paper explaining why it doesn’t bother using Vulnerability Performance Measures at all.
Notice how much of this research focuses on analyzing the code itself. This amounts to a lot of complexity measures, code feature analysis, and churn analysis (how often the code changes). Yet as a developer I know the crap places I’ve been that are likely to be pnw3d.
When it comes to websites, who is likely to be cracked? It comes down to the following: Do they have a million managers, a bunch of people with four-year degrees from two-year trade schools, no or poor processes, or anything resembling the waterfall model? Do they think software, particularly an operating system, is like fine wine and gets better with age? Another tell: If they have gone through the pointless exercise of determining their Capability Maturity Model level, then hire crap developers who write crappier code. The best developers have left ... efficiently.
If on top of everything else there's a reason to crack the website -- there's something valuable to steal or the organization's embarrassment will entertain malicious hacker -- then a successful attack is pretty much inevitable.
All sorts of problems certainly seep into code and its structure, but analyzing that code for the likelihood of flaws is hard. Even if you find a vulnerability, determining whether it's exploitable or likely to be exploited is harder. I think a better approach is to account for the bigger attributes along with the code analysis.
So, dear researchers, I give you my top 26 attributes to include in your analysis:
Internal or outsourced development
Average years of experience of development team
Uses source control
Uses static code checking
Test coverage percentage
Static code checking tool
Test coverage tool
Number of issues filed
Number of issues closed in last 90 days
Security certifications in alphabetical order
Average duration on team (turnover)
Language of internally developed software
Lines of code
Cyclomatic complexity measurement
Commits in the last 90 days
Lines changed in the last 90 days
Number of users
Google news mentions
Revenue/transaction value handled by application
Put that through your fraud detection algorithm instead of the usual junk because most terrible software development shops excel not at creating software, but at creating bogus numbers and reports. After you kick out the organizations that say they do everything well 100 percent of the time -- my basic fraud detection -- run this through any classification algorithm with known baddies and out should come Yahoo and friends.
Complexity and churn are only two factors of vulnerability. Who cares enough to attack a website is another factor (beyond basic script-kiddy attacks). But ultimately, understanding the team that developed the software and how they developed it is probably more predictive. If we find a good way to turn machine learning on those metrics, I suspect we can predict who is likely to be successfully attacked next.