From the post:
I first read this post because of the claim that 50% of the code base at Google changes each month. So it says but perhaps more on that another day.
While reading the post I ran across the following:
In order to help identify these hot spots and warn developers, we looked at bug prediction. Bug prediction uses machine-learning and statistical analysis to try to guess whether a piece of code is potentially buggy or not, usually within some confidence range. Source-based metrics that could be used for prediction are how many lines of code, how many dependencies are required and whether those dependencies are cyclic. These can work well, but these metrics are going to flag our necessarily difficult, but otherwise innocuous code, as well as our hot spots. We’re only worried about our hot spots, so how do we only find them? Well, we actually have a great, authoritative record of where code has been requiring fixes: our bug tracker and our source control commit log! The research (for example, FixCache) indicates that predicting bugs from the source history works very well, so we decided to deploy it at Google.
How it works
In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they’ve been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.
So, if that is true for software bugs, doesn’t it stand to reason the same is true for semantic impedance? That is when a user selects one result and then within some time window selects one different from the first, the reason is the first failed to meet their criteria for a match? Same intuition. Users change because the match, in their view, failed.
Rather than trying to “reason” about the semantics of terms, we can simply observe user behavior with regard to those terms in the aggregate. And perhaps even salt the mine as it were with deliberate cases to test theories about the semantics of terms.
I haven’t done the experiment, yet, but it is certainly something that I will be looking into this next year. I think it has definite potential and would scale.