behind the scenes: cleaning dirty data

From the post:

Dirty Data. It’s everywhere! And that’s expected and ok and even frankly good imho — it happens when people are doing complicated things, in the real world, with lots of edge cases, and moving fast. Perfect is the enemy of good.

Alas it’s definitely behind-the-scenes work to find and fix dirty data problems, which means none of us learn from each other in the process. So — here’s a quick post about a dirty data issue we recently dealt with  Hopefully it’ll help you feel comradery, and maybe help some people using the BASE data.

We traced some oaDOI bugs to dirty records from PMC in the BASE open access aggregation database.

BASE = Bielefeld Academic Search Engine.

oaDOI = oaDOI (similar to DOI but points to open access version)

PMC = PubMed Central.

Are you cleaning data or contributing more dirty data?

