Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 5, 2018

Open Letter to NRCC Hackers

Filed under: Cybersecurity,Government,Hacking,Politics,Wikileaks — Patrick Durusau @ 11:04 am

We have never met or communicated but I wanted to congratulate you on the hack of top NRCC officials in 2018. Good show!

I’m sure you remember the drip-drip-drip release technique used by Wikileads with the Clinton emails. I had to check the dates but the first batch was in early October 2016, before the presidential election in November 2016.

The weekly release cycle, with the prior publicity concerning the leak, kept both alternative and mainstream media on the edge of climaxing every week. Even though the emails themselves were mostly office gossip and pettiness found in any office email system.

The most obvious target event for weekly drops of the NRCC emails is the 2020 election but that is subject to change.

Please consider the Wikileaks partial release tactic, which transformed office gossip into front-page news, when you select a target event for releasing the NRCC emails.

Your public service in damaging the NRCC will go unrewarded but not unappreciated. Once again, good show!

February 14, 2018

Wikileaks Has Sprung A Leak

Filed under: Journalism,News,Reporting,Wikileaks — Patrick Durusau @ 5:03 pm

In Leaked Chats, WikiLeaks Discusses Preference for GOP over Clinton, Russia, Trolling, and Feminists They Don’t Like by Micah Lee, Cora Currier.

From the post:

On a Thursday afternoon in November 2015, a light snow was falling outside the windows of the Ecuadorian embassy in London, despite the relatively warm weather, and Julian Assange was inside, sitting at his computer and pondering the upcoming 2016 presidential election in the United States.

In little more than a year, WikiLeaks would be engulfed in a scandal over how it came to publish internal emails that damaged Hillary Clinton’s presidential campaign, and the extent to which it worked with Russian hackers or Donald Trump’s campaign to do so. But in the fall of 2015, Trump was polling at less than 30 percent among Republican voters, neck-and-neck with neurosurgeon Ben Carson, and Assange spoke freely about why WikiLeaks wanted Clinton and the Democrats to lose the election.

“We believe it would be much better for GOP to win,” he typed into a private Twitter direct message group to an assortment of WikiLeaks’ most loyal supporters on Twitter. “Dems+Media+liberals woudl then form a block to reign in their worst qualities,” he wrote. “With Hillary in charge, GOP will be pushing for her worst qualities., dems+media+neoliberals will be mute.” He paused for two minutes before adding, “She’s a bright, well connected, sadistic sociopath.”

Like Wikileaks, the Intercept treats the public like rude children, publishing only what it considers to be newsworthy content:


The archive spans from May 2015 through November 2017 and includes over 11,000 messages, more than 10 percent of them written from the WikiLeaks account. With this article, The Intercept is publishing newsworthy excerpts from the leaked messages.

My criticism of the Intercept’s selective publication of leaks isn’t unique to its criticism of Wikileaks. I have voiced similar concerns about the ICIJ and Wikileaks itself.

I want to believe the Intercept, ICIJ and Wikileaks when they proclaim others have been lying, unfaithful, dishonest, etc.

But that wanting/desire makes it even more important that I critically assess the evidence they advance for their claims.

Selective release of evidence undermines their credibility to be no more than those they accuse.

BTW, if anyone has a journalism 101 guide to writing headlines, send a copy to the Intercept. They need it.

PS: I don’t have an opinion one way or the other on the substance of the Lee/Currier account. I’ve never been threatened with a government missile so can’t say how I would react. Badly I would assume.

April 13, 2017

CIA To Silence Wikileaks? Donate/Leak to Wikileaks

Filed under: CIA,Intelligence,Wikileaks — Patrick Durusau @ 8:03 pm

CIA chief targets WikiLeaks and Julian Assange as ‘hostile,’ vows to take action by Tim Johnson.

From the post:

CIA Director Mike Pompeo on Thursday called the anti-secrecy group WikiLeaks a hostile intelligence service and said the group would soon face decisive U.S. action to stifle its disclosures of leaked material.

“It ends now,” Pompeo said in his first public remarks after 10 weeks on the job, indicating that President Donald Trump will take undefined but forceful action.

Pompeo lashed out aggressively against Julian Assange, the Australian founder of WikiLeaks – who has been holed up in the Ecuadorean embassy in London for nearly five years – calling him a narcissist and “a fraud, a coward hiding behind a screen.”

Really?

Given the perennial failure of the CIA to discover terror attacks before they happen, recognize when governments are about to fall, and maintain their own security, I can’t imagine Assange and Wikileaks are shaking in their boots.

I disagree with Wikileaks on their style of leaking, I prefer faster and unedited leaking but that’s a question of style and not whether to leak.

If, and it’s a big if, Wikileaks is silenced, the world will grow suddenly darker. Much of what Wikileaks has published would not be published by main stream media, much to the detriment of citizens around the world.

Two things you need to do:

The easy one, donate to support WikiLeaks. As often as you can.

The harder one, leak secrets to Wikileaks.

Repressive governments are pressing WikiLeaks, help WikiLeaks make a fire hose of leaks to push them back.

April 7, 2017

Wikileaks Vault 7 “Grasshopper” – A Value Added Listing

Filed under: CIA,Government,Vault 7,Wikileaks — Patrick Durusau @ 1:29 pm

Wikileaks has released Vault 7 “Grasshopper.”

As I have come to expect the release:

  • Is in no particular order
  • Requires loading an HTML page before obtaining a PDF file

Here is a value-added listing that corrects both of those problems (and includes page numbers):

  1. GH-Drop-v1_0-UserGuide.pdf 2 pages
  2. GH-Module-Bermuda-v1_0-UserGuide.pdf 9 pages
  3. GH-Module-Buffalo-Bamboo-v1_0-UserGuide.pdf 7 pages
  4. GH-Module-Crab-v1_0-UserGuide.pdf 6 pages
  5. GH-Module-NetMan-v1_0-UserGuide.pdf 6 pages
  6. GH-Module-Null-v2_0-UserGuide.pdf 5 pages
  7. GH-Module-Scrub-v1_0-UserGuide.pdf 6 pages
  8. GH-Module-Wheat-v1_0-UserGuide.pdf 5 pages
  9. GH-Module-WUPS-v1_0-UserGuide.pdf 6 pages
  10. GH-Run-v1_0-UserGuide.pdf 2 pages
  11. GH-Run-v1_1-UserGuide.pdf 2 pages
  12. GH-ScheduledTask-v1_0-UserGuide.pdf 3 pages
  13. GH-ScheduledTask-v1_1-UserGuide.pdf 4 pages
  14. GH-ServiceDLL-v1_0-UserGuide.pdf 4 pages
  15. GH-ServiceDLL-v1_1-UserGuide.pdf 5 pages
  16. GH-ServiceDLL-v1_2-UserGuide.pdf 5 pages
  17. GH-ServiceDLL-v1_3-UserGuide.pdf 6 pages
  18. GH-ServiceProxy-v1_0-UserGuide.pdf 4 pages
  19. GH-ServiceProxy-v1_1-UserGuide.pdf 5 pages
  20. Grasshopper-v1_1-AdminGuide.pdf 107 pages
  21. Grasshopper-v1_1-UserGuide.pdf 53 pages
  22. Grasshopper-v2_0_1-UserGuide.pdf 134 pages
  23. Grasshopper-v2_0_2-UserGuide.pdf 134 pages
  24. Grasshopper-v2_0-UserGuide.pdf 134 pages
  25. IVVRR-Checklist-StolenGoods-2_0.pdf 2 pages
  26. StolenGoods-2_0-UserGuide.pdf 11 pages
  27. StolenGoods-2_1-UserGuide.pdf 22 pages

If you notice that the Grasshopper-*****-UserGuide.pdf appears in four different versions, good for you!

I suggest you read only Grasshopper-v2_0_2-UserGuide.pdf.

The differences between Grasshopper-v1_1-UserGuide.pdf at 53 pages and Grasshopper-v2_0-UserGuide.pdf at 134 pages, are substantial.

However, between Grasshopper-v2_0-UserGuide.pdf and Grasshopper-v2_0_1-UserGuide.pdf the only differences from Grasshopper-v2_0_2-UserGuide.pdf are these:

diff Grasshopper-v2_0-UserGuide.txt Grasshopper-v2_0_1-UserGuide.txt

4c4
< Grasshopper v2.0 
---
> Grasshopper v2.0.1 
386a387,389
> 
> Payloads arguments can be added with the optional -a parameter when adding a 
> payload component. 


diff Grasshopper-v2_0_1-UserGuide.txt Grasshopper-v2_0_2-UserGuide.txt

4c4
< Grasshopper v2.0.1 
---
> Grasshopper v2.0.2 
1832c1832
< winxppro-sp0 winxppro-sp1 winxppro-sp2 winxppro-sp3 
---
> winxp-x64-sp0 winxp-x64-sp1 winxp-x64-sp2 winxp-x64-sp3 
1846c1846
< winxppro win2003 
---
> winxp-x64 win2003 

Unless you are preparing a critical edition for the CIA and/or you are just exceptionally anal, the latest version, Grasshopper-v2_0_2-UserGuide.pdf, should be sufficient for most purposes.

Not to mention saving you 321 pages of duplicated reading.

Enjoy!

March 31, 2017

Wikileaks Marble – 676 Source Code Files – Would You Believe 295 Unique (Maybe)

Filed under: CIA,Cybersecurity,Wikileaks — Patrick Durusau @ 7:38 pm

Wikileaks released Marble Framework, described as:

Today, March 31st 2017, WikiLeaks releases Vault 7 “Marble” — 676 source code files for the CIA’s secret anti-forensic Marble Framework. Marble is used to hamper forensic investigators and anti-virus companies from attributing viruses, trojans and hacking attacks to the CIA.

Effective leaking doesn’t seem to have recommended itself to Wikileaks.

Marble-Framework-ls-lRS-devworks.txt, is an ls -lRS listing of the devworks directory.

After looking for duplicate files and starting this post, I discovered entirely duplicated directories:

Compare:

devutils/marbletester/props with devutils/marble/props.

devutils/marbletester/props/internal with devutils/marble/props/internal

devutils/marbleextensionbuilds/Marble/Deobfuscators with devutils/marble/Shared/Deobfuscators

That totals to 182 entirely duplicated files.

In Marble-Framework-ls-lRS-devworks-annotated.txt I separated files on the basis of file size. Groups of duplicate files are separated from other files with a blank line and headed by the number of duplicate copies.

I marked only exact file size matches as duplicates, even though files close in size could be the result of insignificant whitespace.

After removing the entirely duplicated directories, there remain 199 duplicate files.

With 182 files in entirely duplicated directories and 199 remaining duplicates brings us to a grand total of 381 duplicate files.

Or the quicker way to say it: Vault 7 Marble — 295 unique source code files for the CIA’s secret anti-forensic Marble Framework.

Wikileaks may be leaking the material just as it was received. But that’s very poor use of your time and resources.

Leak publishers should polish leaks until they have a fire-hardened point.

March 24, 2017

Other Methods for Boiling Vault 7: CIA Hacking Tools Revealed?

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 2:58 pm

You may have other methods for boiling content out of the Wikileaks Vault 7: CIA Hacking Tools Revealed.

To that end, here is the list of deduped files.

Warning: The Wikileaks pages I have encountered are so malformed that repair will always be necessary before using XQuery.

Enjoy!

Efficient Querying of Vault 7: CIA Hacking Tools Revealed

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 1:42 pm

This week we have covered:

  1. Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1) Eliminated duplication and editorial artifacts, 1134 HTML files out of 7809 remain.
  2. Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2 – The PDF Files) Eliminated public and placeholder documents, 114 arguably CIA files remain.
  3. CIA Documents or Reports of CIA Documents? Vault7 All of the HTML files are reports of possibly CIA material but we do know HTML file != CIA document.
  4. Boiling Reports of CIA Documents (Wikileaks CIA Vault 7 CIA Hacking Tools Revealed) The HTML files contain a large amount of cruft, which can be extracted using XQuery and common tools.

Interesting, from a certain point of view, but aside from highlighting bloated leaking from Wikileaks, why should anyone care?

Good question!

Let’s compare the de-duped but raw with the de-duped but boiled document set.

De-duped but raw document set:

De-duped and boiled document set:

In raw count, boiling took us from 2,131,135 words/tokens to 665,202 words/tokens.

Here’s a question for busy reporters/researchers:

Has the CIA compromised the Tor network?

In the raw files, Tor occurs 22660 times.

In the boiled files, Tor occurs 4 times.

Here’s a screen shot of the occurrences:

With TextSTAT, select the occurrence in the concordance and another select (mouse click to non-specialists) takes you to:

In a matter of seconds, you can answer as far as the HTML documents of Vault7 Part1 show, the CIA is talking about top of rack (ToR), a switching architecture for networks. Not, the Tor Project.

What other questions do you want to pose to the boiled Vault 7: CIA Hacking Tools Revealed document set?

Tooling up for efficient queries

First, you need: Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.

Second, grab a copy of: TextSTAT – Simple Text Analysis Tool runs on Windows, GNU/Linux and MacOS. (free)

When you first open TextSTAT, it will invite you to create a copora.

The barrel icon to the far left creates a new corpora. Select it and:

Once you save the new corpora, this reminder about encodings pops up:

I haven’t explored loading Windows files while on a Linux box but will and report back. Interesting to see inclusion of PDF. Something we need to explore after we choose which of the 124 possibly CIA PDF files to import.

Finally, you are at the point of navigating to where you have stored the unzipped Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files:

Select the first file, scroll to the end of the list, press shift and select the last file. Then choose OK. It takes a minute or so to load but it has a progress bar to let you know it is working.

Observations on TextSTAT

As far as I can tell, TextSTAT doesn’t use the traditional stop list of words but enables you to set of maximum and minimum occurrences in the Word Form window. Along with wildcards as well. More flexible than the old stop list practice.

BTW, the context right/left on the Concordance window refers to characters, not words/tokens. Another departure from my experience with concordances. Not a criticism, just an observation of something that puzzled me at first.

Conclusion

The benefit of making secret information available, a la Wikileaks cannot be over-stated.

But making secret information available isn’t the same as making it accessible.

Investigators, reporters, researchers, the oft-mentioned public, all benefit from accessible information.

Next week look for a review of the probably CIA PDF files to see which ones I would incorporate into the corpora. (You may include more or less.)

PS: I’m looking for telecommuting work, editing, research (see this blog), patrick@durusau.net.

March 23, 2017

Boiling Reports of CIA Documents (Wikileaks CIA Vault 7 CIA Hacking Tools Revealed)

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 7:35 pm

Before you read today’s installment on the Wikileaks CIA Vault 7 CIA Hacking Tools Revealed, you should check out the latest drop from Wikileaks: CIA Vault 7 Dark Matter. Five documents and they all look interesting.

I started with a fresh copy of the HTML files in a completely new directory and ran Tidy first, plus fixed:

page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>

which I described in: Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1).

So with a clean and well-formed set of files, I modified the XQuery to collect all of the references to prior versions. Reasoning that any file that was a prior version, we can ditch that, leaving only the latest files.


for $doc in collection('collection.xml')//a[matches(.,'^\d+$')]
return ($doc/@href/string(), ' ')

Unlike the count function we used before, this returns the value of the href attribute and appends a new line, after each one.

I saved that listing to priors.txt and then (your experience may vary on this next step):

xargs rm < priors.txt

WARNING: If your file names have spaces in them, you may delete files unintentionally. My data had no such spaces so this works in this case.

Once I had the set of files without those representing “previous versions,” I’m down to the expected 1134.

That’s still a fair number of files and there is a lot of cruft in them.

For variety I did look at XSLT, but these are some empty XSLT template statements needed to clean these files:

<xsl:template match="style"/>

<xsl:template match="//div[@id = 'submit_wlkey']" />

<xsl:template match="//div[@id = 'submit_help_contact']" />

<xsl:template match="//div[@id = 'submit_help_tor']" />

<xsl:template match="//div[@id = 'submit_help_tips']" />

<xsl:template match="//div[@id = 'submit_help_after']" />

<xsl:template match="//div[@id = 'submit']" />

<xsl:template match="//div[@id = 'submit_help_buttons']" />

<xsl:template match="//div[@id = 'efm-button']" />

<xsl:template match="//div[@id = 'top-navigation']" />

<xsl:template match="//div[@id = 'menu']" />

<xsl:template match="//footer" />

<xsl:template match="//script" />

<xsl:template match="//li[@class = 'comment']" />

Compare the XQuery query, on the command line no less:

for file in *.html; do
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -s:"$file"
-qs:"/html/body//div[@id = 'uniquer']" -o:"$file.new"
done

(The line break in front of -qs: is strictly for formatting for this post.)

The files generated here will not be valid HTML.

Easy enough to fix with another round of Tidy.

After running Tidy, I was surprised to see a large number of very small files. Or at least I interpret 296 files of less than 1K in size to be very small files.

I created a list of them, linked back to the Wikileaks originals (296 Under 1K Wikileaks CIA Vault 7 Hacking Tools Revealed Files) so you can verify that I capture the content reported by Wikileaks. Oh, and here are the files I generated as well, Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.

In case you are interested, boiling the 1134 files took them from 38.6 MB to 8.8 MB of actual content for indexing, searching, concordances, etc.

Using the content only files, tomorrow I will illustrate how you can correlate information across files. Stay tuned!

March 22, 2017

CIA Documents or Reports of CIA Documents? Vault7

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 3:17 pm

As I tool up to analyze the 1134 non-duplicate/artifact HTML files in Vault 7: CIA Hacking Tools Revealed, it occurred to me those aren’t “CIA documents.”

Take Weeping Angel (Extending) Engineering Notes as an example.

Caveat: My range of experience with “CIA documents” is limited to those obtained by Michael Best and others using Freedom of Information Act requests. But that should be sufficient to identify “CIA documents.”

Some things I notice about Weeping Angel (Extending) Engineering Notes:

  1. A Wikileaks header with donation button.
  2. “Vault 7: CIA Hacking Tools Revealed”
  3. Wikileaks navigation
  4. reported text
  5. More Wikileaks navigation
  6. Ads for Wikileaks, Tor, Tails, Courage, bitcoin

I’m going to say that the 1134 non-duplicate/artifact HTML files in Vault7, Part1, are reports of portions (which portions is unknown) of some unknown number of CIA documents.

A distinction that influences searching, indexing, concordances, word frequency, just to name a few.

What I need is the reported text, minus:

  1. A Wikileaks header with donation button.
  2. “Vault 7: CIA Hacking Tools Revealed”
  3. Wikileaks navigation
  4. More Wikileaks navigation
  5. Ads for Wikileaks, Tor, Tails, Courage, bitcoin

Check in tomorrow when I boil 1134 reports of CIA documents to get something better suited for text analysis.

March 21, 2017

Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2 – The PDF Files)

Filed under: Leaks,Vault 7,Wikileaks — Patrick Durusau @ 4:21 pm

You may want to read Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1) before reading this post. In Part 1, I walk you through obtaining a copy of Wikileaks’ Vault 7: CIA Hacking Tools Revealed so you can follow and check my analysis and conclusions.

Fact checking applies to every source, including this blog.

I proofed my listing of the 357 PDF files in the first Vault 7 release and report an increase in arguably CIA files and a slight decline in public documents. An increase from 114 to 125 for the CIA and a decrease from 109 to 98 for public documents.

  1. Arguably CIA – 125
  2. Public – 98
  3. Wikileaks placeholders – 134

The listings to date:

  1. CIA (maybe)
  2. Public documents
  3. Wikileaks placeholders

For public documents, I created hyperlinks whenever possible. (Saying a fact and having evidence are two different things.) Vendor documentation that was not marked with a security classification I counted as public.

All I can say for the Wikileaks placeholders, some 134 of them, is to ignore them unless you like mining low grade ore.

I created notes in the CIA listing to help narrow your focus down to highly relevant materials.

I have more commentary in the works but I wanted to release these listings in case they help others make efficient use of their time.

Enjoy!

PS: A question I want to start addressing this week is how the dilution of a leak impacts the use of same?

March 20, 2017

Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1)

Filed under: CIA,Leaks,Vault 7,Wikileaks — Patrick Durusau @ 10:29 am

Executive Summary:

If you reported Vault 7: CIA Hacking Tools Revealed as containing:

8,761 documents and files from an isolated, high-security network situated inside the CIA’s Center for Cyber Intelligence in Langley, Virgina…. (Vault 7: CIA Hacking Tools Revealed)

you failed to check your facts.

I detail my process below but in terms of numbers:

  1. Of 7809 HTML files, 6675 are duplicates or Wikileaks artifacts
  2. Of 357 PDF files, 134 are Wikileaks artifacts (for materials not released). Of the remaining 223 PDF files, 109 of them are public information, the GNU Make Manual for instance. Out of the 357 pdf files, Wikileaks has delivered 114 arguably from the CIA and some of those are doubtful. (Part 2, forthcoming)

Wikileaks haters will find little solace here. My criticisms of Wikileaks are for padding the leak and not enabling effective use of the leak. Padding the leak is obvious from the inclusion of numerous duplicate and irrelevant documents. Effective use of the leak is impaired by the padding but also by teases of what could have been released but wasn’t.

Getting Started

To start on common ground, fire up a torrent client, obtain and decompress: Wikileaks-Year-Zero-2017-v1.7z.torrent.

Decompression requires this password: SplinterItIntoAThousandPiecesAndScatterItIntoTheWinds

The root directory is year0.

When I run a recursive ls from above that directory:

ls -R year0 | wc -l

My system reports: 8820

Change to the year0 directory and ls reveals:

bootstrap/ css/ highlighter/ IMG/ localhost:6081@ static/ vault7/

Checking the files in vault7:

ls -R vault7/ | wc -l

returns: 8755

Change to the vault7 directory and ls shows:

cms/ files/ index.html logo.png

The directory files/ has only one file, org-chart.png. An organization chart of the CIA but with sub-departments are listed with acronyms and “???.” Did the author of the chart not know the names of those departments? I point that out as the first of many file anomalies.

Some 7809 HTML files are found under cms/.

The cms/ directory has a sub-directory files, plus main.css and 7809 HTML files (including the index.html file).

Duplicated HTML Files

I discovered duplication of the HTML files quite by accident. I had prepared the files with Tidy for parsing with Saxon and compressed a set of those files for uploading.

The 7808 files I compressed started at 296.7 MB.

The compressed size, using 7z, was approximately 3.5 MB.

That’s almost 2 order of magnitude of compression. 7z is good, but it’s not quite that good. 😉

Checking my file compression numbers

You don’t have to take my word for the file compression experience. If you select all the page_*, space_* and user_* HTML files in a file browser, it should report a total size of 296.7 MB.

Create a sub-directory to year0/vault7/cms/, say mkdir htmlfiles and then:

cp *.html htmlfiles

Then: cd htmlfiles

and,

7z a compressedhtml.7z *.html

Run: ls -l compressedhtml.7z

Result: 3488727 Mar 16 16:31 compressedhtml.7z

Tom Harris, in How File Compression Works, explains that:


Most types of computer files are fairly redundant — they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.

If you don’t agree the HTML file are highly repetitive, check the next section where one source of duplication is demonstrated.

Demonstrating Duplication of HTML files

Let’s start with the same file as we look for a source of duplication. Load Cinnamon Cisco881 Testing at Wikileaks into your browser.

Scroll to near the bottom of the file where you see:

Yes! There are 136 prior versions of this alleged CIA file in the directory.

Cinnamon Cisco881 Testinghas the most prior versions but all of them have prior versions.

Are we now in agreement that duplicated versions of the HTML pages exist in the year0/vault7/cms/ directory?

Good!

Now we need to count how many duplicated files there are in year0/vault7/cms/.

Counting Prior Versions of the HTML Files

You may or may not have noticed but every reference to a prior version takes the form:

<a href=”filename.html”>integer</a*gt;

That going to be an important fact but let’s clean up the HTML so we can process it with XQuery/Saxon.

Preparing for XQuery

Before we start crunching the HTML files, let’s clean them up with Tidy.

Here’s my Tidy config file:

output-xml: yes
quote-nbsp: no
show-warnings: no
show-info: no
quiet: yes
write-back: yes

In htmlfiles I run:

tidy -config tidy.config *.html

Tidy reports two errors:


line 887 column 1 - Error: is not recognized!
line 887 column 15 - Error: is not recognized!

Grepping for “declarations>”:

grep "declarations" *.html

Returns:

page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>

The string element is present as well so we open up the file and repair it with XML comments:

<!-- <declarations><string name="½ö"></string></declarations><p>›<br> -->
<!-- prior line commented out to avoid Tidy error, pld 14 March 2017-->

Rerun Tidy:

tidy -config tidy.config *.html

Now Tidy returns no errors.

XQuery Finds Prior Versions

Our files are ready to be queried but 7809 is a lot of files.

There are a number of solutions but a simple one is to create an XML collection of the documents and run our XQuery statements across the files as a set.

Here’s how I created a collection file for these files:

I did an ls in the directory and piped that to collection.xml. Opening the file I deleted index.html, started each entry with <doc href=" and ended each one with "/>, inserted <collection> before the first entry and </collection> after the last entry and then saved the file.

Your version should look something like:

<collection>
  <doc href="page_10158081.html"/>
  <doc href="page_10158088.html"/>
  <doc href="page_10452995.html"/>
...
  <doc href="user_7995631.html"/>
  <doc href="user_8650754.html"/>
  <doc href="user_9535837.html"/>
</collection>

The prior versions in Cinnamon Cisco881 Testing from Wikileaks, have this appearance in HTML source:

<h3>Previous versions:</h3>
<p>| <a href=”page_17760540.html”>1</a> <span class=”pg-tag”><i>empty</i></span>
| <a href=”page_17760578.html”>2</a> <span class=”pg-tag”></span>

…..

| <a href=”page_23134323.html”>135</a> <span class=”pg-tag”>[Xetron]</span>
| <a href=”page_23134377.html”>136</a> <span class=”pg-tag”>[Xetron]</span>
|</p>
</div>

You will need to spend some time with the files (I have obviously) to satisfy yourself that <a> elements that contain only numbers are exclusively used for prior references. If you come across any counter-examples, I would be very interested to hear about them.

To get a file count on all the prior references, I used:

let $count := count(collection('collection.xml')//a[matches(.,'^\d+$')])
return $count

Run that script to find: 6514 previous editions of the base files

Unpacking the XQuery

Rest assured that’s not how I wrote the first XQuery on this data set! 😉

Without exploring all the by-ways and alleys I traversed, I will unpack that query.

First, the goal of the query is to identify every <a> element that only contains digits. Recalling that previous versions link have digits only in their <a> elements.

A shout out to Jonathan Robie, Editor of XQuery, for reminding me that string expressions match substrings unless they are have beginning and ending line anchors. Here:

'^\d+$'

The \d matches only digits, the + enables matching 1 or more digits, and the beginning ^ and ending $ eliminate any <a> elements that might start with one or more digits, but also contains text. Like links to files, etc.

Expanding out a bit more, [matches(.,'^\d+$')], the [ ] enclose a predicate that consist of the matches function, which takes two arguments. The . here represents the content of an <a> element, followed by a comma as a separator and then the regex that provides the pattern to match against.

Although talked about as a “code smell,” the //a in //a[matches(.,'^\d+$')] enables us to pick up the <a> elements wherever they are located. We did have to repair these HTML files and I don’t want to spend time debugging ephemeral HTML.

Almost there! The collection file, along with the collection function, collection('collection.xml') enables us to apply the XQuery to all the files listed in the collection file.

Finally, we surround all of the foregoing with the count function: count(collection('collection.xml')//a[matches(.,'^\d+$')]) and declare a variable to capture the result of the count function: let $count :=

So far so good? I know, tedious for XQuery jocks but not all news reporters are XQuery jocks, at least not yet!

Then we produce the results: return $count.

But 6514 files aren’t 6675 files, you said 6675 files

Yes, your right! Thanks for paying attention!

I said at the top, 6675 are duplicates or Wikileaks artifacts.

Where are the others?

If you look at User #71477, which has the file name, user_40828931.html, you will find it’s not a CIA document but part of Wikileaks administration for these documents. There are 90 such pages.

If you look at Marble Framework, which has the file name, space_15204359.html, you find it’s a CIA document but a form of indexing created by Wikileaks. There are 70 such pages.

Don’t forget the index.html page.

When added together, 6514 (duplicates), 90 (user pages), 70 (space pages), index.html, I get 6675 duplicates or Wikileaks artifacts.

What’s your total?


Tomorrow:

In Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2), I look under year0/vault7/cms/files to discover:

  1. Arguably CIA files (maybe) – 114
  2. Public documents – 109
  3. Wikileaks artifacts – 134

I say “Arguably CIA” because there are file artifacts and anomalies that warrant your attention in evaluating those files.

March 9, 2017

How Bad Is Wikileaks Vault7 (CIA) HTML?

Filed under: HTML,Wikileaks,WWW,XQuery — Patrick Durusau @ 8:29 pm

How bad?

Unless you want to hand correct 7809 html files to use with XQuery, grab the latest copy of Tidy

It’s not the worst HTML I have ever seen, but put that in the context of having seen a lot of really poor HTML.

I’ve “tidied” up a test collection and will grab a fresh copy of the files before producing and releasing a clean set of the HTML files.

Producing a document collection for XQuery processing. Working towards something suitable for application of NLP and other tools.

March 8, 2017

That CIA exploit list in full: … [highlights]

Filed under: CIA,Cybersecurity,Government,Privacy,Security,Wikileaks — Patrick Durusau @ 5:58 pm

That CIA exploit list in full: The good, the bad, and the very ugly by Iain Thomson.

From the post:

We’re still going through the 8,761 CIA documents published on Tuesday by WikiLeaks for political mischief, although here are some of the highlights.

First, though, a few general points: one, there’s very little here that should shock you. The CIA is a spying organization, after all, and, yes, it spies on people.

Two, unlike the NSA, the CIA isn’t mad keen on blanket surveillance: it targets particular people, and the hacking tools revealed by WikiLeaks are designed to monitor specific persons of interest. For example, you may have seen headlines about the CIA hacking Samsung TVs. As we previously mentioned, that involves breaking into someone’s house and physically reprogramming the telly with a USB stick. If the CIA wants to bug you, it will bug you one way or another, smart telly or no smart telly. You’ll probably be tricked into opening a dodgy attachment or download.

That’s actually a silver lining to all this: end-to-end encrypted apps, such as Signal and WhatsApp, are so strong, the CIA has to compromise your handset, TV or computer to read your messages and snoop on your webcam and microphones, if you’re unlucky enough to be a target. Hacking devices this way is fraught with risk and cost, so only highly valuable targets will be attacked. The vast, vast majority of us are not walking around with CIA malware lurking in our pockets, laptop bags, and living rooms.

Thirdly, if you’ve been following US politics and WikiLeaks’ mischievous role in the rise of Donald Trump, you may have clocked that Tuesday’s dump was engineered to help the President pin the hacking of his political opponents’ email server on the CIA. The leaked documents suggest the agency can disguise its operations as the work of a foreign government. Thus, it wasn’t the Russians who broke into the Democrats’ computers and, by leaking the emails, helped swing Donald the election – it was the CIA all along, Trump can now claim. That’ll shut the intelligence community up. The President’s pet news outlet Breitbart is already running that line.

Iain does a good job of picking out some of the more interesting bits from the CIA (alleged) file dump. No, you will have to read Iain’s post for those.

I mention Iain’s post primarily as a way to entice you into reading the all the files in hopes of discovering more juicy tidbits.

Read the files. Your security depends on the indifference of the CIA and similar agencies. Is that your model for privacy?

March 7, 2017

Vault 7: CIA Hacking Tools In Bulk Download

Filed under: Cybersecurity,Security,Wikileaks — Patrick Durusau @ 5:43 pm

If you want to avoid mirroring Vault 7: CIA Hacking Tools Revealed for yourself, check out: https://archive.org/details/wikileaks.vault7part1.tar.

Why Wikileaks doesn’t offer bulk access to its data sets, you would have to ask Wikileaks.

Enjoy!

Wikileaks Armed – You’re Not

Filed under: Cybersecurity,Government,Wikileaks — Patrick Durusau @ 9:33 am

Vault 7: CIA Hacking Tools Revealed (Wikileaks).

Very excited to read:

Today, Tuesday 7 March 2017, WikiLeaks begins its new series of leaks on the U.S. Central Intelligence Agency. Code-named “Vault 7” by WikiLeaks, it is the largest ever publication of confidential documents on the agency.

The first full part of the series, “Year Zero”, comprises 8,761 documents and files from an isolated, high-security network situated inside the CIA’s Center for Cyber Intelligence in Langley, Virgina. It follows an introductory disclosure last month of CIA targeting French political parties and candidates in the lead up to the 2012 presidential election.

Recently, the CIA lost control of the majority of its hacking arsenal including malware, viruses, trojans, weaponized “zero day” exploits, malware remote control systems and associated documentation. This extraordinary collection, which amounts to more than several hundred million lines of code, gives its possessor the entire hacking capacity of the CIA. The archive appears to have been circulated among former U.S. government hackers and contractors in an unauthorized manner, one of whom has provided WikiLeaks with portions of the archive.

Very disappointed to read:


Wikileaks has carefully reviewed the “Year Zero” disclosure and published substantive CIA documentation while avoiding the distribution of ‘armed’ cyberweapons until a consensus emerges on the technical and political nature of the CIA’s program and how such ‘weapons’ should analyzed, disarmed and published.

Wikileaks has also decided to redact and anonymise some identifying information in “Year Zero” for in depth analysis. These redactions include ten of thousands of CIA targets and attack machines throughout Latin America, Europe and the United States. While we are aware of the imperfect results of any approach chosen, we remain committed to our publishing model and note that the quantity of published pages in “Vault 7” part one (“Year Zero”) already eclipses the total number of pages published over the first three years of the Edward Snowden NSA leaks.

For all of the fretting over the “…extreme proliferation risk in the development of cyber ‘weapons’…”, bottom line is Wikileaks and its agents are armed with CIA cyber weapons and you are not.

Assange/Wikileaks have cast their vote in favor of arming themselves and protecting the CIA and others.

Responsible leaking of cyber weapons means arming everyone equally.

January 6, 2017

Online Database of “Verified” Twitter Accounts (Right On!)

Filed under: Twitter,Wikileaks — Patrick Durusau @ 4:12 pm

The WikiLeaks Task Force tweeted on 6 Jan. 2017:

We are thinking of making an online database with all “verified” twitter accounts & their family/job/financial/housing relationships.

There are a number of comments to this tweet, the ones containing “dox,” “doxx,” “doxing,” “creepy,” “evil,” etc. that should be ignored.

Ignored because intelligence agencies, news organizations, merchants, banks, etc. are all collecting and organizing that data and more.

Ignored because the public should not preemptively disarm itself.

If anything, the Wikileaks Task Force should start with “verified” Twitter accounts and expand outwards, rapidly.

The public should be able to rapidly find relationships of individuals nominated for office, who contribute money to candidates, who profit from contracts, who launder public money. The public should have the same advantages intelligence agencies enjoy today.

To the nay-sayers to the WikiLeaks Task Force proposal:

Why do you seek to prevent putting the public on a better footing vis-a-vis government?

Question to my readers: What do the nay-sayers gain from a disarmed public?

November 7, 2016

Drip, Drip, Drip, Leaking At Wikileaks

Filed under: Hillary Clinton,Politics,Wikileaks — Patrick Durusau @ 3:58 pm

wikileaks-dnc-460

Two days before the U.S. Presidential election, Wikileaks released 8,200 emails from the Democratic National Committee (DNC). Which were in addition to its daily drip, drip, drip leaking of emails from John Podesta, Hillary Clinton’s campaign chair.

The New York Times, a sometimes collaborator with Wikileaks (The War Logs (NYT)), has sponsored a series of disorderly and nearly incoherent attacks on Wikileaks for these leaks.

The dominant theme in those attacks is that readers should not worry their shallow and insecure minds about social media but rely upon media outlets to clearly state any truth readers need to know.

I am not exaggerating. The exact language that appears in one such attack was:

…people rarely act like rational, civic-minded automatons. Instead, we are roiled by preconceptions and biases, and we usually do what feels easiest — we gorge on information that confirms our ideas, and we shun what does not.

Is that how you think of yourself? It is how the New York Times thinks about you.

There are legitimate criticisms concerning Wikileaks and its drip, drip, drip leaking but the Times manages to miss all of them.

For example, the daily drops of Podesta emails, selected on some “unknown to the public” criteria, prevented the creation of a coherent narrative by reporters and the public. The next day’s leak might contain some critical link, or not.

Reporters, curators and the public were teased with drips and drabs of information, which served to drive traffic to the Wikileaks site, traffic that serves no public interest.

If that sounds ungenerous, consider that as the game draws to a close, that Wikileaks has finally posted a link to the Podesta emails in bulk: https://file.wikileaks.org/file/podesta-emails/.

To be sure, some use has been made of the Podesta emails, my work and that of others on DKIM signatures (unacknowledged by Wikileaks when it “featured” such verification on email webpages), graphs, etc. but early bulk release of the emails would have enabled much more.

For example:

  • Concordances of the emails and merging those with other sources
  • Connecting the dots to public or known only to others data
  • Entity recognition and linking in extant resources and news stories
  • Fitting the events, people, places into coherent narratives
  • Sentiment analysis
  • etc.

All of that lost because of the “Wikileaks look at me” strategy for releasing the Podesta emails.

I applaud Wikileaks obtaining and leaking data, including the Podesta emails, but, a look at me strategy impairs the full exploration and use of leaked data.

Is that really the goal of Wikileaks?

PS: If you are interested in leaking without games or redaction, ping me. I’m interested in helping with such leaks.

November 5, 2016

Freedom of Speech/Press – Great For “Us” – Not So Much For You (Wikileaks)

Filed under: Free Speech,Politics,Social Media,Wikileaks — Patrick Durusau @ 8:33 pm

The New York Times, sensing a possible defeat of its neo-liberal agenda on November 8, 2016, has loosed the dogs of war on social media in general and Wikileaks in particular.

Consider the sleight of hand in Farhad Manjoo’s How the Internet Is Loosening Our Grip on the Truth, which argues on one hand,


You’re Not Rational

The root of the problem with online news is something that initially sounds great: We have a lot more media to choose from.

In the last 20 years, the internet has overrun your morning paper and evening newscast with a smorgasbord of information sources, from well-funded online magazines to muckraking fact-checkers to the three guys in your country club whose Facebook group claims proof that Hillary Clinton and Donald J. Trump are really the same person.

A wider variety of news sources was supposed to be the bulwark of a rational age — “the marketplace of ideas,” the boosters called it.

But that’s not how any of this works. Psychologists and other social scientists have repeatedly shown that when confronted with diverse information choices, people rarely act like rational, civic-minded automatons. Instead, we are roiled by preconceptions and biases, and we usually do what feels easiest — we gorge on information that confirms our ideas, and we shun what does not.

This dynamic becomes especially problematic in a news landscape of near-infinite choice. Whether navigating Facebook, Google or The New York Times’s smartphone app, you are given ultimate control — if you see something you don’t like, you can easily tap away to something more pleasing. Then we all share what we found with our like-minded social networks, creating closed-off, shoulder-patting circles online.

This gets to the deeper problem: We all tend to filter documentary evidence through our own biases. Researchers have shown that two people with differing points of view can look at the same picture, video or document and come away with strikingly different ideas about what it shows.

You caught the invocation of authority by Manjoo, “researchers have shown,” etc.

But did you notice he never shows his other hand?

If the public is so bat-shit crazy that it takes all social media content as equally trustworthy, what are we to do?

Well, that is the question isn’t it?

Manjoo invokes “dozens of news outlets” who are tirelessly but hopelessly fact checking on our behalf in his conclusion.

The strong implication is that without the help of “media outlets,” you are a bundle of preconceptions and biases doing what feels easiest.

“News outlets,” on the other hand, are free from those limitations.

You bet.

If you thought Majoo was bad, enjoy seething through Zeynep Tufekci’s claims that Wikileaks is an opponent of privacy, sponsor of censorship and opponent of democracy, all in a little over 1,000 words (1069 exact count). Wikileaks Isn’t Whistleblowing.

It’s a breath taking piece of half-truths.

For example, playing for your sympathy, Tufekci invokes the need of dissidents for privacy. Even to the point of invoking the ghost of the former Soviet Union.

Tufekci overlooks and hopes you do as well, that these emails weren’t from dissidents, but from people who traded in and on the whims and caprices at the pinnacles of American power.

Perhaps realizing that is too transparent a ploy, she recounts other data dumps by Wikileaks to which she objects. As lawyers say, if the facts are against you, pound on the table.

In an echo of Manjoo, did you know you are too dumb to distinguish critical information from trivial?

Tufekci writes:


These hacks also function as a form of censorship. Once, censorship worked by blocking crucial pieces of information. In this era of information overload, censorship works by drowning us in too much undifferentiated information, crippling our ability to focus. These dumps, combined with the news media’s obsession with campaign trivia and gossip, have resulted in whistle-drowning, rather than whistle-blowing: In a sea of so many whistles blowing so loud, we cannot hear a single one.

I don’t think you are that dumb.

Do you?

But who will save us? You can guess Tufekci’s answer, but here it is in full:


Journalism ethics have to transition from the time of information scarcity to the current realities of information glut and privacy invasion. For example, obsessively reporting on internal campaign discussions about strategy from the (long ago) primary, in the last month of a general election against a different opponent, is not responsible journalism. Out-of-context emails from WikiLeaks have fueled viral misinformation on social media. Journalists should focus on the few important revelations, but also help debunk false misinformation that is proliferating on social media.

If you weren’t frightened into agreement by the end of her parade of horrors:


We can’t shrug off these dangers just because these hackers have, so far, largely made relatively powerful people and groups their targets. Their true target is the health of our democracy.

So now Wikileaks is gunning for democracy?

You bet. 😉

Journalists of my youth, think Vietnam, Watergate, were aggressive critics of government and the powerful. The Panama Papers project is evidence that level of journalism still exists.

Instead of whining about releases by Wikileaks and others, journalists* need to step up and provide context they see as lacking.

It would sure beat the hell out of repeating news releases from military commanders, “justice” department mouthpieces, and official but “unofficial” leaks from the American intelligence community.

* Like any generalization this is grossly unfair to the many journalists who work on behalf of the public everyday but lack the megaphone of the government lapdog New York Times. To those journalists and only them, do I apologize in advance for any offense given. The rest of you, take such offense as is appropriate.

Clinton/Podesta Map (through #30)

Filed under: Data Mining,Hillary Clinton,Politics,Wikileaks — Patrick Durusau @ 7:26 pm

Charlie Grapski created Navigating Wikileaks: A Guide to the Podesta Emails.

podesta-map-grapski-460

The listing take 365 pages to date so this is just a tiny sample image.

I don’t have a legend for the row coloring but have tweeted to Charlie about the same.

Enjoy!

November 2, 2016

Wikileaks Podesta Docs Proven To Be False

Filed under: Government,Wikileaks — Patrick Durusau @ 7:01 pm

Glenn Greenwald tweeted this list of all the Podesta Docs from Wikileaks that have proven to be false:

https://t.co/3QAb3LLxn0

Journalists should keep that in mind when judging contested facts between Wikileaks and government sources.

Yes?

October 22, 2016

Validating Wikileaks Emails [Just The Facts]

Filed under: Cybersecurity,Hillary Clinton,Journalism,News,Reporting,Wikileaks — Patrick Durusau @ 8:27 pm

A factual basis for reporting on alleged “doctored” or “falsified” emails from Wikileaks has emerged.

Now to see if the organizations and individuals responsible for repeating those allegations, some 260,000 times, will put their doubts to the test.

You know where my money is riding.

If you want to verify the Podesta emails or other email leaks from Wikileaks, consult the following resources.

Yes, we can validate the Wikileaks emails by Robert Graham.

From the post:

Recently, WikiLeaks has released emails from Democrats. Many have repeatedly claimed that some of these emails are fake or have been modified, that there’s no way to validate each and every one of them as being true. Actually, there is, using a mechanism called DKIM.

DKIM is a system designed to stop spam. It works by verifying the sender of the email. Moreover, as a side effect, it verifies that the email has not been altered.

Hillary’s team uses “hillaryclinton.com”, which as DKIM enabled. Thus, we can verify whether some of these emails are true.

Recently, in response to a leaked email suggesting Donna Brazile gave Hillary’s team early access to debate questions, she defended herself by suggesting the email had been “doctored” or “falsified”. That’s not true. We can use DKIM to verify it.

Bob walks you through validating a raw email from Wikileaks with the DKIM verifier plugin for Thunderbird. And demonstrating the same process can detect “doctored” or “falsified” emails.

Bob concludes:

I was just listening to ABC News about this story. It repeated Democrat talking points that the WikiLeaks emails weren’t validated. That’s a lie. This email in particular has been validated. I just did it, and shown you how you can validate it, too.

Btw, if you can forge an email that validates correctly as I’ve shown, I’ll give you 1-bitcoin. It’s the easiest way of solving arguments whether this really validates the email — if somebody tells you this blogpost is invalid, then tell them they can earn about $600 (current value of BTC) proving it. Otherwise, no.

BTW, Bob also points to:

Here’s Cryptographic Proof That Donna Brazile Is Wrong, WikiLeaks Emails Are Real by Luke Rosiak, which includes this Python code to verify the emails:

clinton-python-email-460

and,

Verifying Wikileaks DKIM-Signatures by teknotus, offers this manual approach for testing the signatures:

clinton-sig-check-460

But those are all one-off methods and there are thousands of emails.

But the post by teknotus goes on:

Preliminary results

I only got signature validation on some of the emails I tested initially but this doesn’t necessarily invalidate them as invisible changes to make them display correctly on different machines done automatically by browsers could be enough to break the signatures. Not all messages are signed. Etc. Many of the messages that failed were stuff like advertising where nobody would have incentive to break the signatures, so I think I can safely assume my test isn’t perfect. I decided at this point to try to validate as many messages as I could so that people researching these emails have any reference point to start from. Rather than download messages from wikileaks one at a time I found someone had already done that for the Podesta emails, and uploaded zip files to Archive.org.

Emails 1-4160
Emails 4161-5360
Emails 5361-7241
Emails 7242-9077
Emails 9078-11107

It only took me about 5 minutes to download all of them. Writing a script to test all of them was pretty straightforward. The program dkimverify just calls a python function to test a message. The tricky part is providing context, and making the results easy to search.

Automated testing of thousands of messages

It’s up on Github

It’s main output is a spreadsheet with test results, and some metadata from the message being tested. Results Spreadsheet 1.5 Megs

It has some significant bugs at the moment. For example Unicode isn’t properly converted, and spreadsheet programs think the Unicode bits are formulas. I also had to trap a bunch of exceptions to keep the program from crashing.

Warning: I have difficulty opening the verify.xlsx file. In Calc, Excel and in a CSV converter. Teknotus reports it opens in LibreOffice Calc, which just failed to install on an older Ubuntu distribution. Sing out if you can successfully open the file.

Journalists: Are you going to validate Podesta emails that you cite? Or that others claim are false/modified?

October 19, 2016

The Podesta Emails [In Bulk]

Filed under: Government,Hillary Clinton,Searching,Topic Maps,Wikileaks — Patrick Durusau @ 7:53 pm

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:

podesta-basic-search-460

and, advanced:

podesta-adv-search-460

As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!

Enjoy!

October 15, 2016

Why Journalists Should Not Rely On Wikileaks Indexing – Podesta Emails

Filed under: Journalism,News,Reporting,Wikileaks — Patrick Durusau @ 3:58 pm

Clinton on Fracking, or, Another Reason to Avoid Wikileaks Indexing

fracking-podesta-460

The quote in the tweet is false.

Politico supplies the correct quotation in its post:


“Bernie Sanders is getting lots of support from the most radical environmentalists because he’s out there every day bashing the Keystone pipeline. And, you know, I’m not into it for that,” Clinton told the unions, according to the transcript. “My view is, I want to defend natural gas. … I want to defend fracking under the right circumstances.”

I’m guessing that “…under the right circumstances.” must have pushed Wikileaks too close to the 140 character barrier.

Ditto for the Wikileaks mis-quote of: “Get a life.”

Which reported as in the tweet, appears to refer to unbridled fracking.

Not so in the Politico post:


“I’m already at odds with the most organized and wildest” of the environmental movement, Clinton told building trades unions in September 2015, according to a transcript of the remarks apparently circulated by her aides. “They come to my rallies and they yell at me and, you know, all the rest of it. They say, ‘Will you promise never to take any fossil fuels out of the earth ever again?’ No. I won’t promise that. Get a life, you know.”

Doesn’t read quite the same way does it?

I supposed once you start lying it’s really hard to stop. Clinton is a good example of that and Wikileaks should not follow her example.

It’s hard to spot these lies because Wikileaks isn’t indexing the attachments.

You can search all day for “defend fracking,” “get a life” (by Clinton) and you will come up empty (at least as of today).

So that you don’t have to search for: 20150909 Transcript | Building Trades Union (Keystone XL) at Wikileaks – Podesta Emails, I have produced a PDF version of that attachment, Building-Trades-Union-Clinton-Sept-09-2015.pdf (my naming), for your viewing pleasure.

August 24, 2016

How to Navigate Wikileak Torrents (wlstorage.net)?

Filed under: Wikileaks — Patrick Durusau @ 2:10 pm

How to Navigate Wikileak Torrents (wlstorage.net)?. A query I posted earlier today at Open Data on Stack Exchange.

From the query:

I can download Wikileak files from either wlstorage.net or file.wikileaks.org but I’m having difficulty identifying the files of interest.

For example, at http://wikileaks.org, you see “DNC Email Archive,” and “AKP Email Archive,” but I have been unable to match those with any entry for the Wikileaks archives. Dates don’t help because the archives all list as 01-Jan-1984.

Am I missing a well known mapping file to the archives? Thanks!

A mapping from common names for collections to the archives would be a very useful thing.

Pointers? Suggestions?

August 17, 2016

Double Standards At NPR

Filed under: Government,Journalism,News,Reporting,Wikileaks — Patrick Durusau @ 4:00 pm

NPR Host Demands That Assange Do Something Its Own Reporters Are Told Never to Do by Naomi LaChance.

From the post:

In a ten-minute interview aired Wednesday morning, NPR’s David Greene asked Wikileaks founder Julian Assange five times to reveal the sources of the leaked information he has published on the internet.

A major tenet of American journalism is that reporters protect their sources. Wikileaks is certainly not a traditional news organization, but Greene’s persistent attempts to get Assange to violate confidentiality was alarming, especially considering that there has been no challenge to the authenticity of the material in question.

NPR (National Public Radio) shows its true colors, not as a free and independent press but as a lackey of the Democratic Party in this interview with Assange.

David Greene (Morning Edition) was fixated on repeating the unconfirmed reports that the Russians (which Russians no one every says), were behind the leak of DNC emails.

You can read the transcript of Assange/Greene interview for yourself.

Greene never asks one substantive question about the 20,000 emails. Not one. The first leak of its kind and all Greene does is whine about rumors of Russian involvement.

Well, that’s not entirely fair, Greene does have this exchange with Assange:


GREENE: Well, let me – apart from the different investigations, could you see people in the U.S. government thinking that you might be a threat to national security?

ASSANGE: Well, I mean, there’s great people in the U.S. government – many of them are our sources – and there’s terrible people in the U.S. government. Unfortunately, the U.S. government is a – you know, a reflection, to some degree, of the rest of society. So it’s filled with its share of paranoid and sociopathic power climbers…

GREENE: But is it paranoid to look at these uncensored documents?

ASSANGE: …People who make errors of judgment, etc.

GREENE: Is it paranoid to look at these uncensored documents, these emails, that are released by you? And if they believe that that could change a U.S. presidential election, could be a threat to national security, why isn’t it logical…

ASSANGE: I just – I mean…

GREENE: …For them to see you as a possible threat?

Hmmm, telling the truth about DNC emails can be a threat to national security?

What a bizarre concept in a democracy! Disclosure of evidence of manipulation of the democratic process is a “…threat to national security?”

NPR can and should do better than David Greene shilling for the Democratic Party.

August 16, 2016

WikiLeaks AKP dump contains 80 types of malware (!OutLook)

Filed under: Cybersecurity,Wikileaks — Patrick Durusau @ 7:34 pm

WikiLeaks AKP dump contains 80 types of malware by Nicky Cappella.

From the post:

The latest WikiLeaks AKP email contains more than 80 types of malware, an independent researcher has confirmed. The malware includes ransomware and remote-access trojans.

WikiLeaks released emails from the Turkish political party AKP in two parts: one in July, and one on August 5. Anti-virus and malware expert Vesselin Bontchev reviewed the content of those emails and published his findings on his GitHub page. Bontchev listed more than 200 individual emails that contain a link to a confirmed malicious attachment.

His report shows a link to infected emails on the WikiLeaks site, the URL for the malware attachment within the email, and a link to a VirusTotal page, showing the way that different anti-virus scanners are reporting the malware. The URL to the malicious attachment has been made unclickable by substituting ‘hxxxxx’ for ‘https’, as the URL listed is a direct link to the malware and a click would result in an immediate download.

A word to the wise I suppose.

You weren’t going to look at a stolen email archive using OutLook were you?

August 4, 2016

Joel Simon (@Joelcpj): Woodward and Bernstein Not “Ethical and Committed” Journalists

Filed under: Government,Journalism,News,Reporting,Wikileaks — Patrick Durusau @ 10:01 am

Joel Simon‘s opinion piece How journalists can cover leaks without helping spies, leaves you with the conclusion that Woodward and Bernstein (Watergate) were not “ethical and committed” journalists.

Skipping the nationalistic ranting and “compelling evidence,” which turns out to be the New York Times parroting surmises and guesses by known liars (U.S. intelligence community), Simon writes of the Wikileaks dump of DNC emails:


As for WikiLeaks, by publishing a data dump without verifying the source or providing its readers with the context to make informed decisions about the motivations of the leakers, it is allowing itself to be a vehicle for governments like Russia that are weaponizing information and using it to achieve policy objectives. Ethical and committed journalists should do all within their power to ensure they are never put in such a position. (emphasis added)

For more than thirty years, 1972 – 2005, the Watergate source known as “Deep Throat (W. Mark Felt),” and his motives, remained a mystery to the American public.

Yet, his revelations were instrumental in bringing down an American president (Richard Nixon).

Mark Felt was a friend of Bob Woodward and their meeting in a parking garage on October 9th, 1972, lead to the October 10, 1972 Washington Post story titled: FBI Finds Nixon Aides Sabotaged Democrats.

In case you don’t remember, 1972 was a presidential election year, with the election being held on November 7, 1972.

Consider those three dates, the discussion between Bernstein and Felt (October 9, 1972), the Washington Post story (October 10, 1972) and the presidential election (November 7, 1972). Or perhaps better:


October 9, 1972 – 29 days until voting begins in presidential election

October 10, 1972 – 28 days until voting begins in presidential election

November 7, 1972 (election day)

The timing of the leak and its publication by the Washington Post less than thirty (30) days prior to a presidential election certainly make the motives of the leaker a relevant question.

Yet, Deep Throat remained unknown and “…readers with[out] the context to make informed decisions about the motivations of the [Deep Throat/Mark Felt]…” for more than thirty years.

Contrary to Joel Simon’s criteria, Woodward and Bernstein verified and corroborated the information given to them by Deep Throat/Mark Felt to be truthful and did not explore for their readers, any possible motivations on his part.

The authenticity of the DNC emails has not been challenged and resignations of Wasserman Schultz (DNC Chair), Amy Dacey (DNC CEO), Brad Marshall (DNC CFO), Luis Miranda (DNC Communications Director) and an public apology to Bernie Sanders by the Democratic National Committee, are all supporting evidence that the DNC email leak is both accurate and authentic.

Unlike Joel Simon, I think Woodward and Bernstein were “ethical and committed” journalists during Watergate, providing their readers with accurate information in a timely manner.

Without exploring the motives of why someone would leak truthful information.

The CJR, Joel Simon and the media generally should abandon its attempt to twist journalistic ethics to exclude publication of truthful information of legitimate interest to a voting public.

Judging from the tone of Simon’s post, his concerns are driven more by rabid nationalism and jingoism than any legitimate concern for journalistic ethics.

July 23, 2016

Wikileaks Mentions In DNC Email – .000718%. Hillary To/From Emails – .000000% (RDON)

Filed under: Politics,Wikileaks — Patrick Durusau @ 4:30 pm

Cryptome tweeted today:

wikileaks-dnc-460

Would you believe that Hillary Clinton is more irrelevant than Wikileaks?

Consider the evidence:

Search for hillaryclinton.com at Search the DNC email database

Scrape the 533 results, as of Saturday, 23 July 2016, into a file.

Grep for hillaryclinton.com and pipe that to another file.

Clean out the remaining markup, insert line returns for commas in cc: field, lowercase and sort, then uniq.

Results:

  1. aelrod@hillaryclinton.com – Adrienne K. Elrod
  2. creynolds@hillaryclinton.com – never a sender
  3. dcheng@hillaryclinton.com – Dennis Cheng
  4. djtspeaks@hillaryclinton.com – never a sender
  5. jklein@hillaryclinton.com – Justin Klein
  6. jschwerin@hillaryclinton.com – Josh Schwerin
  7. kgasperine@hillaryclinton.com – Kathleen Gasperine
  8. lroitman@hillaryclinton.com – Lindsay Roitman
  9. mhalle@hillaryclinton.com – never a sender
  10. mjennings@hillaryclinton.com – Mary Rutherford Jennings
  11. press@hillaryclinton.com – no author
  12. tvclips@hillaryclinton.com – 1 post, no sig
  13. zpetkanas@hillaryclinton.com – Zac Petkanas

That’s right! From January of 2015 until May of 2016, Hillary Clinton apparently had no emails to or from the DNC.

I find that to be unlikely to say the least.

What’s your explanation for the absence of Hillary Clinton emails to and from the DNC?

My explanation that Wikileaks is manipulating both the data and all of us.

Here’s a motto for data leaks: Raw Data Or Nothing (RDON)

Say it, repeat it, demand it – RDON!

July 22, 2016

Yes Luis, There Is A Fuck You Emoji

Filed under: Politics,Wikileaks — Patrick Durusau @ 8:29 pm

Luis Miranda, Communications Director of the DNC asks:

fuck-you-emoji-460

Yes, there is a Fuck You emoji!

For example, here is the Google version:

google-fuck-you

I don’t know if Luis is still looking for an answer to that question but if so, consider it answered!

Searching the DNC email database can be amusing, even educational as the question from Luis demonstrates, I would prefer the ability to browse and to download the dataset for deeper analysis.

What have you found in the DNC email database?

July 5, 2016

Hillary Clinton Email Archive

Filed under: Government,Wikileaks — Patrick Durusau @ 1:33 pm

Hillary Clinton Email Archive by Wikileaks.

From the webpage:

On March 16, 2016 WikiLeaks launched a searchable archive for 30,322 emails & email attachments sent to and from Hillary Clinton’s private email server while she was Secretary of State. The 50,547 pages of documents span from 30 June 2010 to 12 August 2014. 7,570 of the documents were sent by Hillary Clinton. The emails were made available in the form of thousands of PDFs by the US State Department as a result of a Freedom of Information Act request. The final PDFs were made available on February 29, 2016.

“Truthers” may be interested in this searchable archive of Clinton’s emails while Secretary of State.

“Truthers” because the FBI’s recommendation of no charges effectively ends this particular approach to derail Clinton’s run for the presidency.

Many wish the result were different but when the last strike is called, arguing about it isn’t going to change the score of the game.

New evidence and new facts, on the other hand, are unknown factors and could make a difference whereas old emails will not.

Are you going to be looking for new evidence and facts or crying over calls in a game already lost?

Older Posts »

Powered by WordPress