Archive for the ‘Conversion’ Category


Friday, January 31st, 2014

Transform DOCX to HTML/CSS with High-Fidelity using PowerTools for Open XML by Eric White.

From the post:

Today I am happy to announce the release of HtmlConverter version 2.06.00, which is a high fidelity conversion from DOCX to HTML/CSS. HtmlConverter is a module in the PowerTools for Open XML project.

HtmlConverter.cs 2.06.00 supports:

  • Paragraph styles, character styles, and table styles, including styles that are based on other styles.
  • Table styles includes support for conditional table style options (header row, total row, banded rows, first column, last column, and banded columns.
  • Fonts, including font styles such as bold, italic, underline, strikethrough, foreground and background colors, shading, sub-script, super-script, and more.  HtmlConverter is, in effect, guidance on how to correctly determine the font and formatting for each paragraph and text run in a document.
  • Numbered and bulleted lists.  Current support is only for en-US and fr-FR; however, HtmlConverter is factored and parameterized so that you can support other languages without altering the source code.  In the near future, I’ll be publishing guidance and instructions on how to support additional languages, and I’ll be asking for volunteers to write and contribute the bits of code to generate canonical (one, two, three) and ordinal (first, second, third) implementations for your native language, as well as the various Asian and RTL numbering systems.
  • Tabs, including left tabs, right tabs, centered tabs, and decimal tabs.  HtmlConverter takes the approach of using font metrics to calculate the exact width of the various pieces of text in a line, and inserts <span> elements with precisely calculated widths.
  • High fidelity support for vertical white space and horizontal white space, including indented text, hanging indents, centered text, right justified text, and justified text.
  • Borders around paragraphs, and high fidelity for borders of tables.
  • Horizontally and vertically merged cells in tables.
  • External hyperlinks, and internal hyperlinks to bookmarks within the document.
  • You have much more control over the conversion when compared to other approaches to converting to HTML.  There are already a number of parameters that enable you to control the transformation, and in the future I’ll be adding many more knobs and levers to fine tune the conversion.  And of course, you have the source code, so you can customize the conversion for your scenario.

See Eric’s post for questions about what priority desired features should have for addition to HtmlConverter.


PowerTools for Open XML is licensed under the Microsoft Public License (Ms-PL), which gives you wide latitude in how you use the code, including its use in commercial products and open source projects.

It won’t be long until “not open source” software will be worthy of comment.

I first saw this in a tweet by Open Microsoft.

Introducing Tabula

Thursday, April 4th, 2013

Introducing Tabula by Manuel Aristarán, Mike Tigas.

From the post:

Tabula lets you upload a (text-based) PDF file into a simple web interface and magically pull tabular data into CSV format.

It is hard to say why governments and other imprison tabular data in PDF files.

I suspect they see some advantage in preventing comparison to other data or even checking the consistency of data in a single report.

Whatever their motivations, let’s disappoint them!

Details on how to help are in the blog post.

pdfx v1.0 [PDF-to-XML]

Thursday, December 27th, 2012

pdfx v1.0

From the homepage:

Fully-automated PDF-to-XML conversion of scientific text

I submitted Static and Dynamic Semantics of NoSQL Languages, a paper I blogged about earlier this week. Twenty-four pages of lots of citations and equations.

I forgot to set a timer but it isn’t for the impatient. I think the conversion ran more than ten (10) minutes.

Some mathematical notation defeats the conversion process.

See: Static-and-Dynamic-Semantics-NoSQL-Languages.tar.gz for the original PDF plus the HTML and PDF outputs.

For occasional conversions where heavy math notation isn’t required, this may prove to be quite useful.