21st-Century Data Miners Meet 19th-Century Electrical Cables by Cynthia Rudin, Rebecca J. Passonneau, Axinia Radeva, Steve Ierome, and Delfina F. Isaac, Computer, June 2011 (vol. 44 no. 6).
As they say, the past is never far behind. In this case, about 5% of the low-voltage cables in Manhattan were installed before 1930. The records of Consolidated Edison (ConEd) on its cabling and manholes to access it, vary in form, content and originate in different departments, starting in the 1880’s. Yes, the 1880’s for those of you who thing the 1980’s are ancient history.
From the article:
The text in trouble tickets is very irregular and thus challenging to process in its raw form. There are many spellings of each word–for instance, the term “service box” has at least 38 variations, including SB, S, S/B, S.B, S?B, S/BX, SB/X, S/XB, /SBX, S.BX, S&BX, S?BX, S BX, S/B/X, S BOX, SVBX, SERV BX, SERV-BOX, SERV/BOX, and SERVICE BOX.
Similar difficulties plagued determining the type of event from trouble tickets, etc.
Read the article for the details on how the researchers were successful at showing legacy data can assist in the maintenance of a current electrical grid.
I suspect that “service box” is used by numerous utilities and with similar divergences in its recording. A more general application written as a topic map would preserve all those variations and use them in searching other data records. It is the reuse of user analysis and data that make them so valuable.