Archive for the ‘HTML’ Category

How Bad Is Wikileaks Vault7 (CIA) HTML?

Thursday, March 9th, 2017

How bad?

Unless you want to hand correct 7809 html files to use with XQuery, grab the latest copy of Tidy

It’s not the worst HTML I have ever seen, but put that in the context of having seen a lot of really poor HTML.

I’ve “tidied” up a test collection and will grab a fresh copy of the files before producing and releasing a clean set of the HTML files.

Producing a document collection for XQuery processing. Working towards something suitable for application of NLP and other tools.

TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Thursday, May 5th, 2016

Convert TEI XML to HTML with XQuery and BaseX by Adam Steffanick.

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

What is Scholarly HTML?

Saturday, October 31st, 2015

What is Scholarly HTML? by Robin Berjon and Sébastien Ballesteros.


Scholarly HTML is a domain-specific data format built entirely on open standards that enables the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is encoded as a document. It is, itself, written in Scholarly HTML.

The abstract is accurate enough but the “Motivation” section provides a better sense of this project:

Scholarly articles are still primarily encoded as unstructured graphics formats in which most of the information initially created by research, or even just in the text, is lost. This was an acceptable, if deplorable, condition when viable alternatives did not seem possible, but document technology has today reached a level of maturity and universality that makes this situation no longer tenable. Information cannot be disseminated if it is destroyed before even having left its creator’s laptop.

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

This is not solely a loss for the high principles of knowledge sharing in science, it also has very immediate pragmatic consequences. Any tool, any service that tries to integrate with scholarly publishing has to spend the brunt of its complexity (or budget) extracting data the author would have willingly shared out of antiquated formats. This places stringent limits on the improvement of the scholarly toolbox, on the discoverability of scientific knowledge, and particularly on processes of meta-analysis.

To address these issues, we have followed an approach rooted in established best practices for the reuse of open, standard formats. The «HTML Vernacular» body of practice provides guidelines for the creation of domain-specific data formats that make use of HTML’s inherent extensibility (Science.AI, 2015b). Using the vernacular foundation overlaid with «» metadata we have produced a format for the interchange of scholarly articles built on open standards, ready for all to use.

Our high-level goals were:

  • Uncompromisingly enabling structured metadata, accessibility, and internationalisation.
  • Pragmatically working in Web browsers, even if it occasionally incurs some markup overhead.
  • Powerfully customisable for inclusion in arbitrary Web sites, while remaining easy to process and interoperable.
  • Entirely built on top of open, royalty-free standards.
  • Long-term viability as a data format.

Additionally, in view of the specific problem we addressed, in the creation of this vernacular we have favoured the reliability of interchange over ease of authoring; but have nevertheless attempted to cater to the latter as much as possible. A decent boilerplate template file can certainly make authoring relatively simple, but not as radically simple as it can be. For such use cases, Scholarly HTML provides a great output target and overview of the data model required to support scholarly publishing at the document level.

An example of an authoring format that was designed to target Scholarly HTML as an output is the DOCX Standard Scientific Style which enables authors who are comfortable with Microsoft Word to author documents that have a direct upgrade path to semantic, standard content.

Where semantic modelling is concerned, our approach is to stick as much as possible to Beyond the obvious advantages there are in reusing a vocabulary that is supported by all the major search engines and is actively being developed towards enabling a shared understanding of many useful concepts, it also provides a protection against «ontological drift» whereby a new vocabulary is defined by a small group with insufficient input from a broader community of practice. A language that solely a single participant understands is of limited value.

In a small, circumscribed number of cases we have had to depart from, using the (prefixed with sa:) vocabulary instead (Science.AI, 2015a). Our goal is to work with in order to extend their vocabulary, and we will align our usage with the outcome of these discussions.

I especially enjoyed the observation:

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

I don’t doubt the truth of that story but after all, a large number of people are interested in baking cupcakes. Not more than three in many cases, are interested in reading any particular academic paper.

The use of will provide advantages for common concepts but to be truly useful for scholarly writing, it will require serious extension.

Take for example my post yesterday Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]. What microdata from would help readers find Propositionalisation and Aggregates, 2001, which describes substantially the same technique, without claims of surpassing human intuition? (Uncited by the authors the paper on deep feature synthesis.)

Or the 161 papers on propositionalisation that you can find at CiteSeer?

A crude classification that can be used by search engines is very useful but falls far short of the mark in terms of finding and retrieving scholarly writing.

Semantic uniformity for classifying scholarly content hasn’t been reached by scholars or librarians despite centuries of effort. Rather than taking up that Sisyphean task, let’s map across the ever increasing universe of semantic diversity.

W3C Validation Tools – New Location

Wednesday, June 3rd, 2015

W3C Validation Tools

The W3C graciously hosts the following free validation tools:

CSS Validator – Checks your Cascading Style Sheets (CSS)

Internationalization Checker – Checks level of internationalization-friendliness.

Link Checker – Checks your web pages for broken links.

Markup Validator – Checks the markup of your Web documents (HTML or XHTML).

RSS Feed Validator – Validator for syndicated feeds. (RSS and Atom feeds)

RDF Validator – Checks and visualizes RDF documents.

Unicorn – Unified validator. HTML, CSS, Links & Mobile. Checks HTML5.

I mention that these tools are free to emphasize there is no barrier to their use.

Just as you wouldn’t submit a research paper with pizza grease stains on it, use these tools to proof draft standards before you submit them for review.

HTML5 is a W3C Recommendation

Tuesday, October 28th, 2014

HTML5 is a W3C Recommendation

From the post:

(graphic omitted) The HTML Working Group today published HTML5 as W3C Recommendation. This specification defines the fifth major revision of the Hypertext Markup Language (HTML), the format used to build Web pages and applications, and the cornerstone of the Open Web Platform.

Today we think nothing of watching video and audio natively in the browser, and nothing of running a browser on a phone,” said Tim Berners-Lee, W3C Director. “We expect to be able to share photos, shop, read the news, and look up information anywhere, on any device. Though they remain invisible to most users, HTML5 and the Open Web Platform are driving these growing user expectations.

HTML5 brings to the Web video and audio tracks without needing plugins; programmatic access to a resolution-dependent bitmap canvas, which is useful for rendering graphs, game graphics, or other visual images on the fly; native support for scalable vector graphics (SVG) and math (MathML); annotations important for East Asian typography (Ruby); features to enable accessibility of rich applications; and much more.

The HTML5 test suite, which includes over 100,000 tests and continues to grow, is strengthening browser interoperability. Learn more about the Test the Web Forward community effort.

With today’s publication of the Recommendation, software implementers benefit from Royalty-Free licensing commitments from over sixty companies under W3C’s Patent Policy. Enabling implementers to use Web technology without payment of royalties is critical to making the Web a platform for innovation.

Read the Press Release, testimonials from W3C Members, and
acknowledgments. For news on what’s next after HTML5, see W3C CEO Jeff Jaffe’s blog post: Application Foundations for the Open Web Platform. We also invite you to check out our video Web standards for the future.

Just in case you have been holding off on HTML5 until it became an W3C Recommendation. 😉


The Case for HTML Word Processors

Wednesday, October 1st, 2014

The Case for HTML Word Processors by Adam Hyde.

From the post:

Making a case for HTML editors as stealth Desktop Word Processors…the strategy has been so stealthy that not even the developers realised what they were building.

We use all these over-complicated softwares to create Desktop documents. Microsoft Word, LibreOffice, whatever you like – we know them. They are one of the core apps in any users operating system. We also know that they are slow, unwieldy and have lots of quirky ways of doing things. However most of us just accept that this is the way it is and we try not to bother ourselves by noticing just how awful these softwares actually are.

So, I think it might be interesting to ask just this simple question – what if we used Desktop HTML Editors instead of Word Processors to do Word Processing? It might sound like an irrational proposition…Word Processors are, after all, for Word Processing. HTML editors are for creating…well, …HTML. But lets just forget that. What if we could allow ourselves to imagine we used an HTML editor for all our word processing needs and HTML replaces .docx and .odt and all those other over-burdened word processing formats. What do we win and what do we lose?

I’m not convinced about HTML word processors but Adam certainly starts with the right question:

What do we win and what do we lose? (emphasis added)

Line your favorite word processing format up along side HTML + CSS and calculate the wins and loses.

Not that HTML word processors can, should or will replace complex typography when appropriate, but how many documents need the full firepower of a modern word processor?

I would ask a similar question about authoring interfaces for topic maps. What is the least interface that can usefully produce a topic map?

The full bells and whistle versions are common now (I omit naming names) but should those be the only choices?

PS: As far as MS Word, I use “open,” “close,” “save,” “copy,” “paste,” “delete,” “hyperlink,” “bold,” and “italic.” What’s that? Nine operations? You experience may vary. 😉

I use LaTeX and another word processing application for most of my writing off the Web.

I first saw this in a tweet by Ivan Herman

Shadow DOM

Tuesday, March 25th, 2014

Shadow DOM by Steven Wittens.

From the post:

For a while now I’ve been working on MathBox 2. I want to have an environment where you take a bunch of mathematical legos, bind them to data models, draw them, and modify them interactively at scale. Preferably in a web browser.

Unfortunately HTML is crufty, CSS is annoying and the DOM’s unwieldy. Hence we now have libraries like React. It creates its own virtual DOM just to be able to manipulate the real one—the Agile Bureaucracy design pattern.

The more we can avoid the DOM, the better. But why? And can we fix it?

One of the better posts on markup that I have read in a very long time.

Also of interest, Steven’s heavy interest in graphics and visualization.

His MathBox project for example.

Identifiers, 404s and Document Security

Wednesday, December 12th, 2012

I am working on a draft about identifiers (using the standard <a> element) when it occurred to me that URLs could play an unexpected role in document security. (At least unexpected by me, your mileage may vary.)

What if I create a document that has URLs like:

<a href="http://server-exists.x/page-does-not.html>text content</a>

So that a user who attempts to follow the link, gets a “404” message back.

Why is that important?

What if I am writing HTML pages at a nuclear weapon factory? I would be very interested in knowing if one of my pages had gotten off the reservation so to speak.

The server being accessed for a page that deliberately does not exist could route the contact information for an appropriate response.

Of course, I would use better names or have pages that load, while transmitting the same contact information.

Or have a very large uuencoded “password” file that burps, bumps and slowly downloads. (Always knew there was a reason to keep a 2400 baud modem around.)

Have suggestions on how to make a non-existent URL work but will save that for another day. [Pump Up Web Technology Search Clutter]

Tuesday, October 9th, 2012

From the webpage:

We are an open community of developers building resources for a better web, regardless of brand, browser or platform. Anyone can contribute and each person who does makes us stronger. Together we can continue to drive innovation on the Web to serve the greater good. It starts here, with you.

From Matt Brian:

In an attempt to create the “definitive resource” for all open Web technologies, Apple, Adobe, Facebook, Google, HP, Microsoft, Mozilla, Nokia, and Opera have joined the W3C to launch a new website called ‘Web Platform

The new website will serve as a single source of relevant, up-to-date and quality information on the latest HTML5, CSS3, and other Web standards, offering tips on web development and best practises for the technologies.

I first saw this at the (Angela Guess).

So, maybe having documentation, navigable and good documentation, isn’t so weird after all. 😉

Assume I search for guidance on HTML5, CSS3, etc. Now there is a new site to add to web technology search results.

Glad to see the site, but not the addition to search clutter.

I suppose you could boost the site in response to all searches for web technology. Wonder if that will happen?

Doesn’t help your local silo of links.

At or Near Final Calls on W3C Provenance

Wednesday, October 3rd, 2012

I saw a notice today about the ontology part of the W3C work on provenance. Some of it is at final call or nearly so. If you are interested, see:

  • PROV-DM, the PROV data model for provenance;
  • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model;
  • PROV-N, a notation for provenance aimed at human consumption;
  • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of PROV to RDF;
  • PROV-AQ, the mechanisms for accessing and querying provenance;
  • PROV-PRIMER, a primer for the PROV data model.

My first impression is the provenance work is more complex than HTML 3.2 and therefore unlikely to see widespread adoption. (You may want to bookmark that link. It isn’t listed on the HTML page at the W3C, even under obsolete versions.)

HTML [Lessons in Semantic Interoperability – Part 3]

Sunday, September 2nd, 2012

If HTML is an example of semantic interoperability, are there parts of HTML that can be re-used for more semantic interoperability?

Some three (3) year old numbers on usage of HTML elements:

Element Percentage
a 21.00
td 15.63
br 9.08
div 8.23
tr 8.07
img 7.12
option 4.90
li 4.48
span 3.98
table 3.15
font 2.80
b 2.32
p 1.98
input 1.79
script 1.77
strong 0.97
meta 0.95
link 0.66
ul 0.65
hr 0.37

Assuming they still hold true, the <a> element is by far the most popular.

Implications for a semantic interoperability solution that leverages on the <a> element?

Leave the syntax the hell alone!

As we saw in parts 1 and 2 of this series, the <a> element has:

  • simplicity
  • immediate feedback

If you don’t believe me, teach someone who doesn’t know HTML at all how to create an <a> element and verify its presence in browser. (I’ll wait.)

Back so soon? 😉

To summarize: The <a> element is simple, has immediate feedback and is in widespread use.

All of which makes it a likely candidate to leverage for semantic interoperability. But how?

And what of all the other identifiers in the world? What happens to them?

HTML [Lessons in Semantic Interoperability – Part 2]

Saturday, September 1st, 2012

While writing Elli (Erlang Web Server) [Lessons in Semantic Interoperability – Part 1], I got distracted by the realization that web servers produce semantically interoperable content every day. Lots of it. For hundreds of millions of users.

My question: What makes the semantics of HTML different?

The first characteristic that came to mind was simplicity. Unlike some markup languages, ;-), HTML did not have to await the creation of WYSIWYG editors to catch on. In part I suspect because after a few minutes with it, most users (not all), could begin to author HTML documents.

Think about the last time you learned something new. What is the one thing that brings closure to the learning experience?

Feedback, knowing if your attempt at an answer is right or wrong. If right, you will attempt the same solution under similar circumstances in the future. If wrong, you will try again (hopefully).

When HTML appeared, so did primitive (in today’s terms) web browsers.

Any user learning HTML could get immediate feedback on their HTML authoring efforts.


  • After installing additional validation software
  • After debugging complex syntax or configurations
  • After millions of other users do the same thing
  • After new software appears to take advantage of it

Immediate feedback means just that immediate feedback.

The second characteristic is immediate feedback.

You can argue that such feedback was an environmental factor and not a characteristic of HTML proper.

Possibly, possibly but if such a distinction is possible and meaningful, how does it help with the design/implementation of the next successful semantic interoperability language?

I would argue by whatever means, any successful semantic interoperability language is going to include immediate feedback, however you classify it.

Creating Your First HTML 5 Web Page [HTML5 – Feature Freeze?]

Saturday, August 18th, 2012

Creating Your First HTML 5 Web Page by Michael Dorf.

From the post:

Whether you have been writing web pages for a while or you are new to writing HTML, the new HTML 5 elements are still within your reach. It is important to learn how HTML 5 works since there are many new features that will make your pages better and more functional. Once you get your first web page under your belt you will find that they are very easy to put together and you will be on your way to making many more.

To begin, take a look at this base HTML page we will be working with. This is just a plain-ol’ HTML page, but we can start adding HTML5 elements to jazz it up!

But that’s not why I am posting it here. 😉

A little later Michael says:

The new, simple DOCTYPE is much easier to remember and use than previous versions. The W3C is trying to stop versioning HTML so that backwards compatibility will become easier, so there are “technically” no more versions of HTML.

I’m not sure I follow on “…to stop versioning HTML so that backwards compatibility will become easier….”

Unless that means that HTML (5 I assume) is going into a feature/semantic freeze?

That would promote backwards compatibility but I am not sure is a good solution.

Just curious if you have heard the same?


23 Useful Online HTML5 Tools

Friday, December 30th, 2011

23 Useful Online HTML5 Tools

Just in case you are working on delivery of topic maps using HTML5.

I am curious about the “Are you aware that HTML5 is captivating the web by leaps and bounds?” lead off line.

Particularly when I read articles like: HTML5: Current progress and adoption rates.

Or the following quote from: HTML5 Adoption Might Hurt Apple’s Profit, Research Finds

The switch from native apps to HTML5 apps will not happen overnight. At the moment, HTML5 apps have some problems that native apps do not. HTML5 apps are typically slower than native apps, which is a particularly important issue for games. An estimated 20 percent of mobile games will most likely never be Web apps, Bernstein said.

Furthermore, there are currently differences in Web browsers across mobile platforms that can raise development costs for HTML5 apps. They can also pose a greater security risk. This can result in restricting access to underlying hardware by handset manufacturers to reduce the possible impact of these risks.

Taking all this into account, Bernstein Research reckoned that HTML5 will mature in the next few years, which will in turn have an impact on Apple’s revenue growth. Nevertheless, the research firm, which itself makes a market in Apple, still recommended investing in the company.

Apple executives are reported to be supporters of HTML5. Which makes sense if you think about it. By the time HTML5 matures enough to be a threat, Apple will have moved on, leaving the HTML5ers to fight over what is left in a diminishing market share. Supporting a technology that makes your competition’s apps slower and less secure makes sense as well.

How are you using HTML5 with topic maps?

These Aren’t the Sites You’re Looking For: Building Modern Web Apps

Sunday, November 20th, 2011

These Aren’t the Sites You’re Looking For: Building Modern Web Apps

Interesting promo for HTML5, which is a developing way to deliver interaction with a topic map.

The presentation does not focus on use of user feedback, the absence of which can leave you with a “really cool” interface that no one outside the development team really likes. To no small degree, it is good interface design with users that tells the tale, not how the interface is seen to work on the “other” side of the screen.

BTW, the slides go out of their way to promote the Chrome browser. Browser usage statistics, you do the math. Marketing is a matter of numbers, not religion.

If you are experimenting with HTML5 as a means to interact with a topic map engine, would appreciate a note when you are ready to go public.

HTML5 web dev reading list

Thursday, October 27th, 2011

HTML5 web dev reading list

I am sure there are more of these than can be easily counted.

Suggestions on others that will be particularly useful for people developing topic map interfaces? (Should not be different from effective interfaces in general.)


The Simple Way to Scrape an HTML Table: Google Docs

Sunday, October 23rd, 2011

The Simple Way to Scrape an HTML Table: Google Docs

From the post:

Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC’s List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.

OK, not a great trick but if you are in a hurry it may be a useful one.

Of course, I get the excuse from local governments that their staff can’t export data in useful formats (I get images of budget documents in PDF files, how useful is that?).

Open – Videos

Thursday, October 13th, 2011

Open – Videos

For those of you who don’t think HTML5 and developers are all that weird:

Full-length videos from the first two TimesOpen events, HTML5 and Beyond, and Innovating Developer Culture, are now available. Approximately five (5!) hours in total, there’s a lot of good information.

We have the lineup in place for the next TimesOpen on Personalization & Privacy, taking place Tuesday October 25, 6:30 p.m., at the Times Building. Details and registration information will be posted soon (like next week).

HTML Data Task Force

Sunday, October 2nd, 2011

HTML Data Task Force, chaired by Jeni Tennison.

Another opportunity to participate in important work at the W3C without a membership. The “details” of getting diverse formats to work together.

Close analysis may show the need for changes to syntaxes, etc., but as far as mapping goes, topic maps can take syntaxes as they are. Could be an opportunity to demonstrate working solutions for actual use cases.

From the wikipage:

This HTML Data Task Force considers RDFa 1.1 and microdata as separate syntaxes, and conducts a technical analysis on the relationship between the two formats. The analysis discusses specific use cases and provide guidance on what format is best suited for what use cases. It further addresses the question how different formats can be used within the same document when required and how data expressed in the different formats can be combined by consumers.

The task force MAY propose modifications in the form of bug reports and change proposals on the microdata and/or RDFa specifications, to help users to easily transition between the two syntaxes or use them together. As with all such comments, the ultimate decisions on implementing these will rest with the respective Working Groups.

Further, the Task Force should also produce a draft specifications of mapping algorithms from an HTML+microdata content to RDF, as well as a mapping of RDFa to microdata’s JSON format. These MAY serve as input documents to possible future recommendation track works. These mappings should be, if possible, generic, i.e., they should not be dependent on any particular vocabulary. A goal for these mappings should be to facilitate the use of both formats with the same vocabularies without creating incompatibilities.

The Task Force will also consider design patterns for vocabularies, and provide guidance on how vocabularies should be shaped to be usable with both microdata and RDFa and potentially with microformats. These patterns MAY lead to change proposals of existing (RDF) vocabularies, and MAY result in general guidelines for the design of vocabularies for structured data on the web, building on existing community work in this area.

The Task Force liaises with the SWIG Web Schemas Task Force to ensure that lessons from real-world experience are incorporated into the Task Force recommendations and that any best practices described by the Task Force are synchronised with real-world practice.

The Task Force conducts its work through the mailing list (use this link to subscribe or look at the public archives), as well as on the #html-data-tf channel of the (public) W3C IRC server.