From the post:
From the original post:
One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.
So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O’Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.
Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.
I would not down play the importance of Drew’s descriptive analysis.
Until you can describe something, it is really difficult to explain it.