Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 9, 2017

How Bad Is Wikileaks Vault7 (CIA) HTML?

Filed under: HTML,Wikileaks,WWW,XQuery — Patrick Durusau @ 8:29 pm

How bad?

Unless you want to hand correct 7809 html files to use with XQuery, grab the latest copy of Tidy

It’s not the worst HTML I have ever seen, but put that in the context of having seen a lot of really poor HTML.

I’ve “tidied” up a test collection and will grab a fresh copy of the files before producing and releasing a clean set of the HTML files.

Producing a document collection for XQuery processing. Working towards something suitable for application of NLP and other tools.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress