Before you read today’s installment on the Wikileaks CIA Vault 7 CIA Hacking Tools Revealed, you should check out the latest drop from Wikileaks: CIA Vault 7 Dark Matter. Five documents and they all look interesting.
I started with a fresh copy of the HTML files in a completely new directory and ran Tidy first, plus fixed:
page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>
which I described in: Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1).
So with a clean and well-formed set of files, I modified the XQuery to collect all of the references to prior versions. Reasoning that any file that was a prior version, we can ditch that, leaving only the latest files.
for $doc in collection('collection.xml')//a[matches(.,'^\d+$')]
return ($doc/@href/string(), '
')
Unlike the count
function we used before, this returns the value of the href attribute and appends a new line,
after each one.
I saved that listing to priors.txt and then (your experience may vary on this next step):
xargs rm < priors.txt
WARNING: If your file names have spaces in them, you may delete files unintentionally. My data had no such spaces so this works in this case.
Once I had the set of files without those representing “previous versions,” I’m down to the expected 1134.
That’s still a fair number of files and there is a lot of cruft in them.
For variety I did look at XSLT, but these are some empty XSLT template statements needed to clean these files:
<xsl:template match="style"/>
<xsl:template match="//div[@id = 'submit_wlkey']" />
<xsl:template match="//div[@id = 'submit_help_contact']" />
<xsl:template match="//div[@id = 'submit_help_tor']" />
<xsl:template match="//div[@id = 'submit_help_tips']" />
<xsl:template match="//div[@id = 'submit_help_after']" />
<xsl:template match="//div[@id = 'submit']" />
<xsl:template match="//div[@id = 'submit_help_buttons']" />
<xsl:template match="//div[@id = 'efm-button']" />
<xsl:template match="//div[@id = 'top-navigation']" />
<xsl:template match="//div[@id = 'menu']" />
<xsl:template match="//footer" />
<xsl:template match="//script" />
<xsl:template match="//li[@class = 'comment']" />
Compare the XQuery query, on the command line no less:
for file in *.html; do
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -s:"$file"
-qs:"/html/body//div[@id = 'uniquer']" -o:"$file.new"
done
(The line break in front of -qs: is strictly for formatting for this post.)
The files generated here will not be valid HTML.
Easy enough to fix with another round of Tidy.
After running Tidy, I was surprised to see a large number of very small files. Or at least I interpret 296 files of less than 1K in size to be very small files.
I created a list of them, linked back to the Wikileaks originals (296 Under 1K Wikileaks CIA Vault 7 Hacking Tools Revealed Files) so you can verify that I capture the content reported by Wikileaks. Oh, and here are the files I generated as well, Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.
In case you are interested, boiling the 1134 files took them from 38.6 MB to 8.8 MB of actual content for indexing, searching, concordances, etc.
Using the content only files, tomorrow I will illustrate how you can correlate information across files. Stay tuned!