New Open XML PowerTool Cmdlet simplifies retrieval of document metrics by Doug Mahugh.
From the post:
It’s been a good year for Open XML developers. The release of the Open XML SDK as an open source project back in June was well-received by the community, and enabled contributions such as the makefile to automate use of the SDK on Mono and a Visual Studio project for the SDK. Project leader Eric White has worked to refine and improve the testing process, and here at MS Open Tech we’ve been working with our China team to get the word out, starting with mirroring the repo to GitCafe for ease of access in China.
Today there’s another piece of good news for Open XML developers: Eric White has added a new Get-DocxMetrics Cmdlet to the Open XML PowerTools, the latest step in a developer-focused reengineering of the PowerTools to make them even more flexible and useful to Open XML developers. As Eric explains in his blog post on the Open XML Developer site:
My latest foray is a new Cmdlet, Get-DocxMetrics, which returns a lot of useful information about a WordprocessingML document. A summary of the information it returns for a document:
- The style hierarchy – styles can inherit from other styles, and it is helpful to know what styles are defined in a document.
- The content control hierarchy. We can examine the hierarchy, and design an XSD schema to validate them.
- The list of languages used in a document, such as en-US, fr-FR, and so on.
- Whether a document contains tracked revisions, text boxes, complex fields, simple fields, altChunk content, tables, hyperlinks, legacy frames, ActiveX controls, sub documents, references to null images, embedded spreadsheets, document protection, multi-font runs, the list of numbering formats used, and more.
- Metrics on how large the document is, including element counts, average paragraph lengths, run count, zero length text elements, ASCII character counts, complex script character counts, East Asia character counts, and the count of runs of each of the variety of characters.
Get-DocxMetrics sounds like a viable way to generate statistics on a collection of OpenXML files to determine what features of OpenXML are actually in use by an enterprise or government. That would make creation of specialized tools for such entities a far more certain proposition.
Output from such analysis would be a nice input into a topic map for purposes of mapping usage to other formats. What maps?, what misses?, etc.
Looking forward to hearing more about this tool in the new year!