Corpus-based Empirical Software Engineering – Ekaterina Pek by Felienne Hermans.
Felienne was live blogging Ekaterina’s presentation and defense (Defense of Ekaterina Pek May 21, 2014) today.
From the presentation notes:
The motivation for Kate’s work, she tells us, is the work of Knuth who empirically studied punchcards with FORTRAN code, in order to discover ‘what programmers really do’, as opposed to ‘what programmers should do’
Kate has the same goal: she wants to measure use of languages:
- frequency counts -> How often are parts of the language used?
- coverage -> What parts of the language are used?
- footprint -> How much of each language part is used?
In order to be able to perform such analyses, we need a ‘corpus’ a big set of language data to work on. Knuth even collected punch cards from garbage bins, because it was so important for him to get more data.
And it is not just code she looked at, also libraries, bugs, emails and commits are taken into account. But some have to be sanitized in order to be usable for the corpus.
Now there is an interesting sea of subjects.
Imagine exploring such a corpus for patterns of bugs and merging in patterns found in bug reports.
After all, bugs are introduced with programmers program as they do in real life, not as they would in theory.
[…] we just talking about a corpus of software earlier today? Are you thinking about a corpus of government open source projects would give you […]
Pingback by Govcode « Another Word For It — May 21, 2014 @ 8:11 pm