How bad?
Unless you want to hand correct 7809 html files to use with XQuery, grab the latest copy of Tidy
It’s not the worst HTML I have ever seen, but put that in the context of having seen a lot of really poor HTML.
I’ve “tidied” up a test collection and will grab a fresh copy of the files before producing and releasing a clean set of the HTML files.
Producing a document collection for XQuery processing. Working towards something suitable for application of NLP and other tools.