Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 3, 2018

Distributed Denial of Secrets (#DDoSecrets) – There’s a New Censor in Town

Filed under: Censorship,CIA,Leaks,NSA — Patrick Durusau @ 6:59 pm

Distributed Denial of Secrets (#DDoSecrets) (ddosecretspzwfy7.onion/)

From a tweet by @NatSecGeek:

Distributed Denial of Secrets (#DDoSecrets), a collective/distribution system for leaked and hacked data, launches today with over 1 TB of data from our back catalogue (more TK).

Great right? Well, maybe not so great:

Our goal is to preserve info and ensure its available to those who need it. When possible, we will distribute complete datasets to everyone. In some instances, we will offer limited distribution due to PII or other sensitive info. #DDoSecrets currently has ~15 LIMDIS releases.

As we’re able, #DDoSecrets will produce sanitized versions of these datasets for public distribution. People who can demonstrate good cause for a copy of the complete dataset will be provided with it.

Rahael Satter in Leak site’s launch shows dilemma of radical transparency documents the sad act of self-mutilation (self-censorship) by #DDoSecrets.

Hosting the Ashley Madison hack drew criticism from Joseph Cox (think Motherboard) and Gabriella Coleman (McGill University anthropologist). The Ashley Madison data is available for searching (by email for example https://ashley.cynic.al/), so the harm of a bulk release isn’t clear.

What is clear is the reasoning of Coleman:


Best said the data would now be made available to researchers privately on a case-by-case basis, a decision that mollified some critics.

“Much better,” said Coleman after reviewing the newly pared-back site. “Exactly the model we might want.”

I am not surprised this is the model Coleman wants, academics are legendary for treating access as a privilege, thus empowering themselves to sit in judgment on others.

Let me explicitly say that I have no doubts that Emma Best will be as fair handed with such judgments as anyone.

But once we concede any basis for censorship, the withholding of information of any type, then we are cast into a darkness from which there is no escape. A censor claims to have withheld only X, but how are we to judge? We have no access to the original data. Only its mutilated, bastard child.

Emma Best is likely the least intrusive censor you can find but what is your response when the CIA or the NSA makes the same claim?

Censorship is a danger when practiced by anyone for any reason.

Support and leak to the project but always condition deposits on raw leaking by #DDoSecrets.

September 6, 2018

Guidance for Leakers

Filed under: Journalism,Leaks,News,Reporting — Patrick Durusau @ 2:19 pm

Our who, what, why leak explainer by Hamish Boland-Rudder.

From the post:

Whistleblowers, like Deep Throat, Daniel Ellsberg, Karen Silkwood, Mordechai Vanunu, Linda Tripp, Jeffrey Wigand, Edward Snowden, Bradley — now Chelsea — Manning and John Doe, come from all walks of life, and stigma and myth tend to surround them.

The International Consortium of Investigative Journalists has lots of experience with information leaks. In the past five years alone, we’ve sifted through about 30 million leaked documents to produce groundbreaking investigations like the Panama Papers, Paradise Papers, Swiss Leaks and Lux Leaks.

The common denominator? Whistleblowers providing information, secretly, in an attempt to expose hidden wrongs.

Famously, whistleblowers have toppled President Richard Nixon, effectively ended the Vietnam War, exposed an Oval Office tryst, revealed nuclear secrets, uncovered environmental and health catastrophes and focused global attention on offshore tax havens.

ICIJ is often approached by concerned citizens who believe they’ve found an injustice that they’d like us to investigate, but few know the first thing about becoming a whistleblower or how to provide information to journalists.

So we thought a basic guide to leaking might prove useful, one laid out using an old journalistic tool: the five W’s and a H (loosely interpreted!)

I deeply respect the work the International Consortium of Investigative Journalist (ICIJ) has done in the past, is doing in the present and will continue to do in the future. Amazing work that has made a real difference for millions of ordinary people around the world.

On the other hand, I have been, am and will be highly critical of the ICIJ over its hoarding of leaks for limited groups of reporters and editing those leaks in a display of paternalism for readers, who haven’t asked for their help.

All that said, do pass this information from the ICIJ around. You never know where the next leaker may be found.

PS: I would not target anyone in government with the material. Better to send everyone in the EPA the same advice. So no one stands out. Same for other government agencies. Your a citizen, write to your government.

August 2, 2018

Leaking 4Julian – Non-Sysadmin Leaking

Filed under: .Net,Journalism,Leaks,News,Reporting — Patrick Durusau @ 6:15 pm

Non-sysadmins read username: 4julian password: $etJulianFree!2Day and wish they could open corporate or government archives up to mining.

Don’t despair! Even non-sysadmins can participate in the Assange Data Tsunami, worldwide leaking of data in the event of the arrest of Julian Assange.

Check out the Whistle Blower FAQ – International Consortium of Investigative Journalists (ICIJ) by Gerald Ryle.

FYI, By some unspecified criteria, the ICIJ decides which individuals and groups mentioned in a leak that merit public exposure and those that do not. This is a universal practice amoung journalists. Avoiding it requires non-journalist outlets.

The ICIJ does a great job with leaks but if I were going to deprive a government or corporation of power over information, why would I empower journalists to make the same tell/don’t tell decision? Let the public decide what to make of the all the information. Assisted by the efforts of journalists but not with some information being known only to the journalists.

From the FAQ:

‘What information should I include?’ and other frequently asked questions about becoming a whistleblower

In my 30-year career as a journalist, I’ve spoken with thousands of potential sources, some of them with interesting tips or insider knowledge, others with massive datasets to share. Conversations often start with questions about the basics of whistleblowing. If you’re thinking about leaking information, here are some of the things you should keep in mind:

Q. What is a whistleblower?

A whistleblower is someone who has evidence of wrongdoing, abuse of power, fraud or misconduct and who shares it with a third party such as an investigative journalism organization like the International Consortium of Investigative Journalists.

By blowing the whistle you can help prevent the possible escalation of misconduct or corruption.

Edward Snowden is one of the world’s best-known Whistleblowers.

Q. Can a whistleblower remain anonymous?

Yes. We will always go out of our way to protect whistleblowers. You can remain anonymous for as long as you want, and, in fact, this is sometimes the best protection that journalists can offer whistleblowers.

Q. What information should I include?

To enable a thorough investigation, you should include a detailed description of the issue you are concerned about. Ideally, you should also include documents or data. The more information you provide, the better the work the journalists can do.

I need to write something up on “raw leaking,” that is not using a journalist. Look for that early next week!

January 12, 2018

Leaking Resources for Federal Employees with Ties to ‘Shithole’ Countries

Filed under: Journalism,Leaks,News,Reporting — Patrick Durusau @ 10:58 am

Trump derides protections for immigrants from ‘shithole’ countries by Josh Dawsey.

From the post:

President Trump grew frustrated with lawmakers Thursday in the Oval Office when they discussed protecting immigrants from Haiti, El Salvador and African countries as part of a bipartisan immigration deal, according to several people briefed on the meeting.

“Why are we having all these people from shithole countries come here?” Trump said, according to these people, referring to countries mentioned by the lawmakers.

The EEOC Annual report for 2014 reports out of 2.7 million women and men employed by the federal government:

…63.50% were White, 18.75% were Black or African American 8.50% were Hispanic or Latino, 6.16% were Asian, 1.49% were American Indian or Alaska Native, 1.16% were persons of Two or More Races and 0.45% were Native Hawaiian or Other Pacific Islander…(emphasis added)

In other words, 27.25% of 2.7 million people working for the federal government, or approximately 794,000 federal employees have ties ‘shithole’ countries.

President Trump’s rude remarks are an accurate reflection of current U.S. immigration policy:

The United States treats other countries ‘shitholes’ but it is considered impolite to mention that in public.

Federal employees with ties to ‘shithole’ countries are at least as loyal, if not more so, than your average staffer.

That said, I’m disappointed that media outlets did not immediately call upon federal employees with ties to ‘shithole’ countries to start leaking documents/data.

Here are some places documents can be leaked to:

More generally, see Here’s how to share sensitive leaks with the press and their excellent listing of SecureDrop resources for anonymous submission of documents.

If you have heard of the Panama Papers or the Paradise Papers, then you are thinking about the International Consortium of Investigative Journalists. They do excellent work, but like the other journalists mentioned, are obsessed with being in control of the distribution of your leak.

Every outrage, whether a shooting, unjust imprisonment, racist remarks, religious bigotry, is an opportunity to incite leaking by members of a group.

Not calling for leaking speaks volumes about your commitment to the status quo and its current injustices.

October 8, 2017

OnionShare – Safely Sharing Email Leaks – 394 Days To Mid-terms

Filed under: Cybersecurity,Email,Government,Journalism,Leaks,News,Reporting — Patrick Durusau @ 4:43 pm

FiveThirtyEight concludes Clinton’s leaked emails had some impact on the 2016 presidential election, but can’t say how much. How Much Did WikiLeaks Hurt Hillary Clinton?

Had leaked emails been less boring and non-consequential, “smoking gun” sort of emails, their impact could have been substantial.

The lesson being the impact of campaign/candidate/party emails is impossible to judge until they have been leaked. Even then the impact may be uncertain.

“Leaked emails” presumes someone has leaked the emails, which in light of the 2016 presidential election, is a near certainty for the 2018 congressional mid-term elections.

Should you find yourself in possession of leaked emails, you may want a way to share them with others. My preference for public posting without edits or deletions, but not everyone shares my confidence in the public.

One way to share files securely and anonymously with specific people is OnionShare.

From the wiki page:

What is OnionShare?

OnionShare lets you securely and anonymously share files of any size. It works by starting a web server, making it accessible as a Tor onion service, and generating an unguessable URL to access and download the files. It doesn’t require setting up a server on the internet somewhere or using a third party filesharing service. You host the file on your own computer and use a Tor onion service to make it temporarily accessible over the internet. The other user just needs to use Tor Browser to download the file from you.

How to Use

http://asxmi4q6i7pajg2b.onion/egg-cain. This is the secret URL that can be used to download the file you’re sharing.

Send this URL to the person you’re sending the files to. If the files you’re sending aren’t secret, you can use normal means of sending the URL, like by emailing it, or sending it in a Facebook or Twitter private message. If you’re sending secret files then it’s important to send this URL securely.

The person who is receiving the files doesn’t need OnionShare. All they need is to open the URL you send them in Tor Browser to be able to download the file.
(emphasis in original)

Download OnionShare 1.1. Versions are available for Windows, Mac OS X, with instructions for Ubuntu, Fedora and other flavors of Linux.

Caveat: If you are sending a secret URL to leaked emails or other leaked data, use ordinary mail, no return address, standard envelope from a package of them you discard, on the back of a blank counter deposit slip, with letters from a newspaper, taped in the correct order, sent to the intended recipient. (No licking, it leaves trace DNA.)

Those are the obvious security points about delivering a secret URL. Take that as a starting point.

PS: I would never contact the person chosen for sharing about shared emails. They can be verified separate and apart from you as the source. Every additional contact puts you in increased danger of becoming part of a public story. What they don’t know, they can’t tell.

September 19, 2017

An Honest Soul At The W3C? EME/DRM Secret Ballot

Filed under: Cybersecurity,DRM,Electronic Frontier Foundation,Leaks,Security,W3C — Patrick Durusau @ 9:49 am

Billions of current and future web users have been assaulted and robbed in what Jeff Jaffe (W3C CEO) calls a “respectful debate.” Reflections on the EME Debate.

Odd sense of “respectful debate.”

A robber demands all of your money and clothes, promises to rent you clothes to get home, but won’t tell you how to make your own clothes. You are now and forever a captive of the robber. (That’s a lay persons summary but accurate account of what the EME crowd wanted and got.)

Representatives for potential victims, the EFF and others, pointed out the problems with EME at length, over years of debate. The response of the robbers: “We want what we want.

Consistently, for years, the simple minded response of EME advocates continued to be: “We want what we want.

If you think I’m being unkind to the EME advocates, consider the language of the Disposition of Comments for Encrypted Media Extensions and Director’s decision itself:


Given that there was strong support to initially charter this work (without any mention of a covenant) and continued support to successfully provide a specification that meets the technical requirements that were presented, the Director did not feel it appropriate that the request for a covenant from a minority of Members should block the work the Working Group did to develop the specification that they were chartered to develop. Accordingly the Director overruled these objections.

The EME lacks a covenant protecting researchers and others from anti-circumvention laws, enabling continued research on security and other aspects of EME implementations.

That covenant was not in the original charter, the director’s “(without any mention of a covenant),” aka, “We want what we want.

There wasn’t ever any “respectful debate,” but rather EME supporters repeating over and over again, “We want what we want.

A position which prevailed, which bring me to the subject of this post. A vote, a secret vote was conducted by the W3C seeking support for the Director’s cowardly and self-interested support for EME, the result of which as been reported as:


Though some have disagreed with W3C’s decision to take EME to recommendation, the W3C determined that the hundreds of millions of users who want to watch videos on the Web, some of which have copyright protection requirements from their creators, should be able to do so safely and in a Web-friendly way. In a vote by Members of the W3C ending mid September, 108 supported the Director’s decision to advance EME to W3C Recommendation that was appealed mid-July through the appeal process, while 57 opposed it and 20 abstained. Read about reflections on the EME debate, in a Blog post by W3C CEO Jeff Jaffe.

(W3C Publishes Encrypted Media Extensions (EME) as a W3C Recommendation)

One hundred and eight members took up the cry of “We want what we want.” rob billions of current and future web users. The only open question being who?

To answer that question, the identity of these robbers, I posted this note to Jeff Jaffe:

Jeff,

I read:

***

In a vote by Members of the W3C ending mid September, 108 supported the Director’s decision to advance EME to W3C Recommendation that was appealed mid-July through the appeal process, while 57 opposed it and 20 abstained.

***

at: https://www.w3.org/2017/09/pressrelease-eme-recommendation.html.en

But I can’t seem to find a link to the vote details, that is a list of members and their vote/abstention.

Can you point me to that link?

Thanks!

Hope you are having a great week!

Patrick

It didn’t take long for Jeff to respond:

On 9/19/2017 9:38 AM, Patrick Durusau wrote:
> Jeff,
>
> I read:
>
> ***
>
> In a vote by Members of the W3C ending mid September, 108 supported the
> Director’s decision to advance EME to W3C Recommendation that was
> appealed mid-July through the appeal process, while 57 opposed it and 20
> abstained.
>
> ***
>
> at: https://www.w3.org/2017/09/pressrelease-eme-recommendation.html.en
>
> But I can’t seem to find a link to the vote details, that is a list of
> members and their vote/abstention.
>
> Can you point me to that link?

It is long-standing process not to release individual vote details publicly.

I wonder about a “long-standing process” for the only vote on an appeal in W3C history but there you have it, the list of robbers isn’t public. No need to search the W3C website for it.

If there is an honest person at the W3C, a person who stands with the billions of victims of this blatant robbery, then we will see a leak of the EME vote.

If there is no leak of the EME vote, that is a self-comment on the staff of the W3C.

Yes?

PS: Kudos to the EFF and others for delaying EME this long but the outcome was never seriously in question. Especially in organizations where continued membership and funding are more important than the rights of individuals.

EME can only be defeated by action in the trenches as it were, depriving its advocates of any perceived benefit and imposing ever higher costs upon them.

You do have your marker pens and sticky tape ready. Yes?

September 14, 2017

Equifax: Theft Versus Sale Increases Your Risk?

Filed under: Cybersecurity,Leaks — Patrick Durusau @ 3:32 pm

Hyperventilating reports about Equifax leak:

Credit reporting firm Equifax says data breach could potentially affect 143 million US consumers

Why the Equifax breach is very possibly the worst leak of personal info ever

The Equifax Breach Exposes America’s Identity Crisis

fail to mention that Equifax was selling access to all 143 million stolen credit reports.

Had the hackers, may their skills be blessed, purchased access to the same 143 million credit reports, not a word of alarm would appear in any press report.

Isn’t that odd? You can legally purchase access to “personal identity data” but if you steal it, the foundations of a credit society are threatened.

Equifax doesn’t prevent purchase/use of its data by known criminal organizations, Wells Fargo and its ‘s 2.1 million fake accounts that now totals 3.5 million (oops, overlooked 1.4 million accounts) for example.

Can you see a difference between a stolen credit report and one purchased by Wells Fargo? Or any other entity with paid access to Equifax data?

Another question, can you identify people employed by the DHS, FBI, CIA, NSA, etc. from the Equifax data?

PS: Before you lose too much sleep over theft of data already for sale, in the case, Equifax credit reports, consider: How Bad Is the Equifax Hack? by Josephine Wolff.

September 13, 2017

New Anti-Leak Program (leaked of course)

Filed under: Journalism,Leaks,News,Reporting — Patrick Durusau @ 9:22 pm

Trump Administration Launches Broad New Anti-Leak Program by Chris Geidner.

From the post:

The top US national security official has directed government departments and agencies to warn employees across the entire federal government next week about the dangers and consequences of leaking even unclassified information.

The Trump administration has already promised an aggressive crackdown on anyone who leaks classified information. The latest move is a dramatic step that could greatly expand what type of leaks are under scrutiny and who will be scrutinized.

In the memo about leaks that was subsequently obtained by BuzzFeed News, National Security Adviser H.R. McMaster details a request that “every Federal Government department and agency” hold a one-hour training next week on “unauthorized disclosures” — of classified and certain unclassified information.

I’m guessing that since BuzzFeed News got the memo leaked before next week, then this didn’t count as as a leak under the new anti-leak program. Yes?

If “next week” means ending 24 September 2017, then leaks on or after 25 September 2017 count as leaks under the new program.

Journalists should include “leak” in all stories based on leaked information to assist researchers in tracking the rate of leaking under the new anti-leaking program.

Every leak is a step towards transparency and accountability.

June 20, 2017

Manning Leaks — No Real Harm (Database of Government Liars Anyone?)

Filed under: Government,Government Data,Leaks — Patrick Durusau @ 2:56 pm

Secret Government Report: Chelsea Manning Leaks Caused No Real Harm by Jason Leopold.

From the post:

In the seven years since WikiLeaks published the largest leak of classified documents in history, the federal government has said they caused enormous damage to national security.

But a secret, 107-page report, prepared by a Department of Defense task force and newly obtained by BuzzFeed News, tells a starkly different story: It says the disclosures were largely insignificant and did not cause any real harm to US interests.

Regarding the hundreds of thousands of Iraq-related military documents and State Department cables provided by the Army private Chelsea Manning, the report assessed “with high confidence that disclosure of the Iraq data set will have no direct personal impact on current and former U.S. leadership in Iraq.”

The 107 page report, redacted, runs 35 pages. Thanks to BuzzFeed News for prying that much of a semblance of the truth out of the government.

It is further proof that US prosecutors and other federal government representatives lie to the courts, the press and the public, whenever its suits their purposes.

Anyone with transcripts from the original Manning hearings, should identify statements by prosecutors at variance with this report, noting the prosecutor’s name, rank and recording the page/line reference in the transcript.

That individual prosecutors and federal law enforcement witnesses lie is a commonly known fact. What I haven’t seen, is a central repository of all such liars and the lies they have told.

I mention a central repository because to say one or two prosecutors have lied or been called down by a judge grabs a headline, but showing a pattern over decades of lying by the state, that could move to an entirely different level.

Judges, even conservative ones (especially conservative ones?), don’t appreciate being lied to by anyone, including the state.

The state has chosen lying as its default mode of operation.

Let’s help them wear that banner.

Interested?

June 9, 2017

Real Talk on Reality (Knowledge Gap on Leaking)

Filed under: Cybersecurity,Government Data,Leaks,NSA — Patrick Durusau @ 8:32 pm

Real Talk on Reality : Leaking is high risk by the grugq.

From the post:

On June 5th The Intercept released an article based on an anonymously leaked Top Secret NSA document. The article was about one aspect of the Russian cyber campaign against the 2016 US election — the targeting of election device manufacturers. The relevance of this aspect of the Russian operation is not exactly clear, but we’ll address that in a separate post because… just hours after The Intercept’s article went live the US Department of Justice released an affidavit (and search warrant) covering the arrest of Reality Winner — the alleged leaker. Let’s look at that!

You could teach a short course on leaking from this one post but there is one “meta” issue that merits your attention.

The failures of Reality Winner and the Intercept signal users need educating in the art of information leaking.

With wide spread tracking of web browsers, training on information leaking needs to be pushed to users. It would stand out if one member of the military requested and was sent an email lesson on leaking. An email that went to everyone in a particular command, not so much.

Public Service Announcements (PSAs) in web zines, as ads, etc. with only the barest of tips, is another mechanism to consider.

If you are very creative, perhaps “Mr. Bill” claymation episodes with one principle of leaking each? Need to be funny enough that viewing/sharing isn’t suspicious.

Other suggestions?

May 2, 2017

Practical Suggestions For Improving Transparency

Filed under: Government,Journalism,Leaks,News,Reporting — Patrick Durusau @ 2:50 pm

A crowd wail about Presidents Obama, Trump, opacity, lack of transparency, loss of democracy, freedom of the press, the imminent death of civilization, etc., isn’t going to improve transparency.

I have two practical suggestions for improving transparency.

First suggestion: Always re-post, tweet, share stories with links to leaked materials. If the story you read doesn’t have such a link, seek out one that does to re-post, tweet, share.

Some stories of leaks include a URL to the leaked material, like Hacker leaks Orange is the New Black new season after ransom demands ignored by Sean Gallagher, or NSA-leaking Shadow Brokers just dumped its most damaging release yet by Dan Goodin, both of Ars Technica

Some stories of the same leaks do not include a URL to the leaked material,The Netflix ‘Orange is the New Black’ Leak Shows TV Piracy Is So 2012 (which does have the best strategy for fighting piracy I have ever read) or, Shadow Brokers leak trove of NSA hacking tools.

Second suggestion: If you encounter leaked materials, post, tweet and share them as widely as possible. (Translations are always needed.)

Improving transparency requires only internet access and the initiative to do so.

Are you game?

March 24, 2017

Other Methods for Boiling Vault 7: CIA Hacking Tools Revealed?

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 2:58 pm

You may have other methods for boiling content out of the Wikileaks Vault 7: CIA Hacking Tools Revealed.

To that end, here is the list of deduped files.

Warning: The Wikileaks pages I have encountered are so malformed that repair will always be necessary before using XQuery.

Enjoy!

Efficient Querying of Vault 7: CIA Hacking Tools Revealed

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 1:42 pm

This week we have covered:

  1. Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1) Eliminated duplication and editorial artifacts, 1134 HTML files out of 7809 remain.
  2. Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2 – The PDF Files) Eliminated public and placeholder documents, 114 arguably CIA files remain.
  3. CIA Documents or Reports of CIA Documents? Vault7 All of the HTML files are reports of possibly CIA material but we do know HTML file != CIA document.
  4. Boiling Reports of CIA Documents (Wikileaks CIA Vault 7 CIA Hacking Tools Revealed) The HTML files contain a large amount of cruft, which can be extracted using XQuery and common tools.

Interesting, from a certain point of view, but aside from highlighting bloated leaking from Wikileaks, why should anyone care?

Good question!

Let’s compare the de-duped but raw with the de-duped but boiled document set.

De-duped but raw document set:

De-duped and boiled document set:

In raw count, boiling took us from 2,131,135 words/tokens to 665,202 words/tokens.

Here’s a question for busy reporters/researchers:

Has the CIA compromised the Tor network?

In the raw files, Tor occurs 22660 times.

In the boiled files, Tor occurs 4 times.

Here’s a screen shot of the occurrences:

With TextSTAT, select the occurrence in the concordance and another select (mouse click to non-specialists) takes you to:

In a matter of seconds, you can answer as far as the HTML documents of Vault7 Part1 show, the CIA is talking about top of rack (ToR), a switching architecture for networks. Not, the Tor Project.

What other questions do you want to pose to the boiled Vault 7: CIA Hacking Tools Revealed document set?

Tooling up for efficient queries

First, you need: Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.

Second, grab a copy of: TextSTAT – Simple Text Analysis Tool runs on Windows, GNU/Linux and MacOS. (free)

When you first open TextSTAT, it will invite you to create a copora.

The barrel icon to the far left creates a new corpora. Select it and:

Once you save the new corpora, this reminder about encodings pops up:

I haven’t explored loading Windows files while on a Linux box but will and report back. Interesting to see inclusion of PDF. Something we need to explore after we choose which of the 124 possibly CIA PDF files to import.

Finally, you are at the point of navigating to where you have stored the unzipped Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files:

Select the first file, scroll to the end of the list, press shift and select the last file. Then choose OK. It takes a minute or so to load but it has a progress bar to let you know it is working.

Observations on TextSTAT

As far as I can tell, TextSTAT doesn’t use the traditional stop list of words but enables you to set of maximum and minimum occurrences in the Word Form window. Along with wildcards as well. More flexible than the old stop list practice.

BTW, the context right/left on the Concordance window refers to characters, not words/tokens. Another departure from my experience with concordances. Not a criticism, just an observation of something that puzzled me at first.

Conclusion

The benefit of making secret information available, a la Wikileaks cannot be over-stated.

But making secret information available isn’t the same as making it accessible.

Investigators, reporters, researchers, the oft-mentioned public, all benefit from accessible information.

Next week look for a review of the probably CIA PDF files to see which ones I would incorporate into the corpora. (You may include more or less.)

PS: I’m looking for telecommuting work, editing, research (see this blog), patrick@durusau.net.

March 23, 2017

Boiling Reports of CIA Documents (Wikileaks CIA Vault 7 CIA Hacking Tools Revealed)

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 7:35 pm

Before you read today’s installment on the Wikileaks CIA Vault 7 CIA Hacking Tools Revealed, you should check out the latest drop from Wikileaks: CIA Vault 7 Dark Matter. Five documents and they all look interesting.

I started with a fresh copy of the HTML files in a completely new directory and ran Tidy first, plus fixed:

page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>

which I described in: Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1).

So with a clean and well-formed set of files, I modified the XQuery to collect all of the references to prior versions. Reasoning that any file that was a prior version, we can ditch that, leaving only the latest files.


for $doc in collection('collection.xml')//a[matches(.,'^\d+$')]
return ($doc/@href/string(), ' ')

Unlike the count function we used before, this returns the value of the href attribute and appends a new line, after each one.

I saved that listing to priors.txt and then (your experience may vary on this next step):

xargs rm < priors.txt

WARNING: If your file names have spaces in them, you may delete files unintentionally. My data had no such spaces so this works in this case.

Once I had the set of files without those representing “previous versions,” I’m down to the expected 1134.

That’s still a fair number of files and there is a lot of cruft in them.

For variety I did look at XSLT, but these are some empty XSLT template statements needed to clean these files:

<xsl:template match="style"/>

<xsl:template match="//div[@id = 'submit_wlkey']" />

<xsl:template match="//div[@id = 'submit_help_contact']" />

<xsl:template match="//div[@id = 'submit_help_tor']" />

<xsl:template match="//div[@id = 'submit_help_tips']" />

<xsl:template match="//div[@id = 'submit_help_after']" />

<xsl:template match="//div[@id = 'submit']" />

<xsl:template match="//div[@id = 'submit_help_buttons']" />

<xsl:template match="//div[@id = 'efm-button']" />

<xsl:template match="//div[@id = 'top-navigation']" />

<xsl:template match="//div[@id = 'menu']" />

<xsl:template match="//footer" />

<xsl:template match="//script" />

<xsl:template match="//li[@class = 'comment']" />

Compare the XQuery query, on the command line no less:

for file in *.html; do
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -s:"$file"
-qs:"/html/body//div[@id = 'uniquer']" -o:"$file.new"
done

(The line break in front of -qs: is strictly for formatting for this post.)

The files generated here will not be valid HTML.

Easy enough to fix with another round of Tidy.

After running Tidy, I was surprised to see a large number of very small files. Or at least I interpret 296 files of less than 1K in size to be very small files.

I created a list of them, linked back to the Wikileaks originals (296 Under 1K Wikileaks CIA Vault 7 Hacking Tools Revealed Files) so you can verify that I capture the content reported by Wikileaks. Oh, and here are the files I generated as well, Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.

In case you are interested, boiling the 1134 files took them from 38.6 MB to 8.8 MB of actual content for indexing, searching, concordances, etc.

Using the content only files, tomorrow I will illustrate how you can correlate information across files. Stay tuned!

March 22, 2017

Leak Publication: Sharing, Crediting, and Re-Using Leaks

Filed under: Data Provenance,Leaks,Open Access,Open Data — Patrick Durusau @ 4:53 pm

If you substitute “leak” for “data” in this essay by Daniella Lowenberg, does it work for leaks as well?

Data Publication: Sharing, Crediting, and Re-Using Research Data by Daniella Lowenberg.

From the post:

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 – a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

As much as I admire the work of the International Consortium of Investigative Journalists (ICIJ, especially its Panama Papers project, sharing data beyond the confines of their community isn’t a value, much less a goal.

As all secret keepers, government, industry, organizations, ICIJ has “reasons” for its secrecy, but none that I find any more or less convincing than those offered by other secret keepers.

Every secret keeper has an agenda their secrecy serves. Agendas that which don’t include a public empowered to make judgments about their secret keeping.

The ICIJ proclaims Leak to Us.

A good place to leak but include with your leak a demand, an unconditional demand, that your leak be released in its entirely within a year or two of its first publication.

Help enable the public to watch all secrets and secret keepers, not just those some secret keepers choose to expose.

CIA Documents or Reports of CIA Documents? Vault7

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 3:17 pm

As I tool up to analyze the 1134 non-duplicate/artifact HTML files in Vault 7: CIA Hacking Tools Revealed, it occurred to me those aren’t “CIA documents.”

Take Weeping Angel (Extending) Engineering Notes as an example.

Caveat: My range of experience with “CIA documents” is limited to those obtained by Michael Best and others using Freedom of Information Act requests. But that should be sufficient to identify “CIA documents.”

Some things I notice about Weeping Angel (Extending) Engineering Notes:

  1. A Wikileaks header with donation button.
  2. “Vault 7: CIA Hacking Tools Revealed”
  3. Wikileaks navigation
  4. reported text
  5. More Wikileaks navigation
  6. Ads for Wikileaks, Tor, Tails, Courage, bitcoin

I’m going to say that the 1134 non-duplicate/artifact HTML files in Vault7, Part1, are reports of portions (which portions is unknown) of some unknown number of CIA documents.

A distinction that influences searching, indexing, concordances, word frequency, just to name a few.

What I need is the reported text, minus:

  1. A Wikileaks header with donation button.
  2. “Vault 7: CIA Hacking Tools Revealed”
  3. Wikileaks navigation
  4. More Wikileaks navigation
  5. Ads for Wikileaks, Tor, Tails, Courage, bitcoin

Check in tomorrow when I boil 1134 reports of CIA documents to get something better suited for text analysis.

March 21, 2017

Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2 – The PDF Files)

Filed under: Leaks,Vault 7,Wikileaks — Patrick Durusau @ 4:21 pm

You may want to read Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1) before reading this post. In Part 1, I walk you through obtaining a copy of Wikileaks’ Vault 7: CIA Hacking Tools Revealed so you can follow and check my analysis and conclusions.

Fact checking applies to every source, including this blog.

I proofed my listing of the 357 PDF files in the first Vault 7 release and report an increase in arguably CIA files and a slight decline in public documents. An increase from 114 to 125 for the CIA and a decrease from 109 to 98 for public documents.

  1. Arguably CIA – 125
  2. Public – 98
  3. Wikileaks placeholders – 134

The listings to date:

  1. CIA (maybe)
  2. Public documents
  3. Wikileaks placeholders

For public documents, I created hyperlinks whenever possible. (Saying a fact and having evidence are two different things.) Vendor documentation that was not marked with a security classification I counted as public.

All I can say for the Wikileaks placeholders, some 134 of them, is to ignore them unless you like mining low grade ore.

I created notes in the CIA listing to help narrow your focus down to highly relevant materials.

I have more commentary in the works but I wanted to release these listings in case they help others make efficient use of their time.

Enjoy!

PS: A question I want to start addressing this week is how the dilution of a leak impacts the use of same?

March 20, 2017

Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1)

Filed under: CIA,Leaks,Vault 7,Wikileaks — Patrick Durusau @ 10:29 am

Executive Summary:

If you reported Vault 7: CIA Hacking Tools Revealed as containing:

8,761 documents and files from an isolated, high-security network situated inside the CIA’s Center for Cyber Intelligence in Langley, Virgina…. (Vault 7: CIA Hacking Tools Revealed)

you failed to check your facts.

I detail my process below but in terms of numbers:

  1. Of 7809 HTML files, 6675 are duplicates or Wikileaks artifacts
  2. Of 357 PDF files, 134 are Wikileaks artifacts (for materials not released). Of the remaining 223 PDF files, 109 of them are public information, the GNU Make Manual for instance. Out of the 357 pdf files, Wikileaks has delivered 114 arguably from the CIA and some of those are doubtful. (Part 2, forthcoming)

Wikileaks haters will find little solace here. My criticisms of Wikileaks are for padding the leak and not enabling effective use of the leak. Padding the leak is obvious from the inclusion of numerous duplicate and irrelevant documents. Effective use of the leak is impaired by the padding but also by teases of what could have been released but wasn’t.

Getting Started

To start on common ground, fire up a torrent client, obtain and decompress: Wikileaks-Year-Zero-2017-v1.7z.torrent.

Decompression requires this password: SplinterItIntoAThousandPiecesAndScatterItIntoTheWinds

The root directory is year0.

When I run a recursive ls from above that directory:

ls -R year0 | wc -l

My system reports: 8820

Change to the year0 directory and ls reveals:

bootstrap/ css/ highlighter/ IMG/ localhost:6081@ static/ vault7/

Checking the files in vault7:

ls -R vault7/ | wc -l

returns: 8755

Change to the vault7 directory and ls shows:

cms/ files/ index.html logo.png

The directory files/ has only one file, org-chart.png. An organization chart of the CIA but with sub-departments are listed with acronyms and “???.” Did the author of the chart not know the names of those departments? I point that out as the first of many file anomalies.

Some 7809 HTML files are found under cms/.

The cms/ directory has a sub-directory files, plus main.css and 7809 HTML files (including the index.html file).

Duplicated HTML Files

I discovered duplication of the HTML files quite by accident. I had prepared the files with Tidy for parsing with Saxon and compressed a set of those files for uploading.

The 7808 files I compressed started at 296.7 MB.

The compressed size, using 7z, was approximately 3.5 MB.

That’s almost 2 order of magnitude of compression. 7z is good, but it’s not quite that good. 😉

Checking my file compression numbers

You don’t have to take my word for the file compression experience. If you select all the page_*, space_* and user_* HTML files in a file browser, it should report a total size of 296.7 MB.

Create a sub-directory to year0/vault7/cms/, say mkdir htmlfiles and then:

cp *.html htmlfiles

Then: cd htmlfiles

and,

7z a compressedhtml.7z *.html

Run: ls -l compressedhtml.7z

Result: 3488727 Mar 16 16:31 compressedhtml.7z

Tom Harris, in How File Compression Works, explains that:


Most types of computer files are fairly redundant — they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.

If you don’t agree the HTML file are highly repetitive, check the next section where one source of duplication is demonstrated.

Demonstrating Duplication of HTML files

Let’s start with the same file as we look for a source of duplication. Load Cinnamon Cisco881 Testing at Wikileaks into your browser.

Scroll to near the bottom of the file where you see:

Yes! There are 136 prior versions of this alleged CIA file in the directory.

Cinnamon Cisco881 Testinghas the most prior versions but all of them have prior versions.

Are we now in agreement that duplicated versions of the HTML pages exist in the year0/vault7/cms/ directory?

Good!

Now we need to count how many duplicated files there are in year0/vault7/cms/.

Counting Prior Versions of the HTML Files

You may or may not have noticed but every reference to a prior version takes the form:

<a href=”filename.html”>integer</a*gt;

That going to be an important fact but let’s clean up the HTML so we can process it with XQuery/Saxon.

Preparing for XQuery

Before we start crunching the HTML files, let’s clean them up with Tidy.

Here’s my Tidy config file:

output-xml: yes
quote-nbsp: no
show-warnings: no
show-info: no
quiet: yes
write-back: yes

In htmlfiles I run:

tidy -config tidy.config *.html

Tidy reports two errors:


line 887 column 1 - Error: is not recognized!
line 887 column 15 - Error: is not recognized!

Grepping for “declarations>”:

grep "declarations" *.html

Returns:

page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>

The string element is present as well so we open up the file and repair it with XML comments:

<!-- <declarations><string name="½ö"></string></declarations><p>›<br> -->
<!-- prior line commented out to avoid Tidy error, pld 14 March 2017-->

Rerun Tidy:

tidy -config tidy.config *.html

Now Tidy returns no errors.

XQuery Finds Prior Versions

Our files are ready to be queried but 7809 is a lot of files.

There are a number of solutions but a simple one is to create an XML collection of the documents and run our XQuery statements across the files as a set.

Here’s how I created a collection file for these files:

I did an ls in the directory and piped that to collection.xml. Opening the file I deleted index.html, started each entry with <doc href=" and ended each one with "/>, inserted <collection> before the first entry and </collection> after the last entry and then saved the file.

Your version should look something like:

<collection>
  <doc href="page_10158081.html"/>
  <doc href="page_10158088.html"/>
  <doc href="page_10452995.html"/>
...
  <doc href="user_7995631.html"/>
  <doc href="user_8650754.html"/>
  <doc href="user_9535837.html"/>
</collection>

The prior versions in Cinnamon Cisco881 Testing from Wikileaks, have this appearance in HTML source:

<h3>Previous versions:</h3>
<p>| <a href=”page_17760540.html”>1</a> <span class=”pg-tag”><i>empty</i></span>
| <a href=”page_17760578.html”>2</a> <span class=”pg-tag”></span>

…..

| <a href=”page_23134323.html”>135</a> <span class=”pg-tag”>[Xetron]</span>
| <a href=”page_23134377.html”>136</a> <span class=”pg-tag”>[Xetron]</span>
|</p>
</div>

You will need to spend some time with the files (I have obviously) to satisfy yourself that <a> elements that contain only numbers are exclusively used for prior references. If you come across any counter-examples, I would be very interested to hear about them.

To get a file count on all the prior references, I used:

let $count := count(collection('collection.xml')//a[matches(.,'^\d+$')])
return $count

Run that script to find: 6514 previous editions of the base files

Unpacking the XQuery

Rest assured that’s not how I wrote the first XQuery on this data set! 😉

Without exploring all the by-ways and alleys I traversed, I will unpack that query.

First, the goal of the query is to identify every <a> element that only contains digits. Recalling that previous versions link have digits only in their <a> elements.

A shout out to Jonathan Robie, Editor of XQuery, for reminding me that string expressions match substrings unless they are have beginning and ending line anchors. Here:

'^\d+$'

The \d matches only digits, the + enables matching 1 or more digits, and the beginning ^ and ending $ eliminate any <a> elements that might start with one or more digits, but also contains text. Like links to files, etc.

Expanding out a bit more, [matches(.,'^\d+$')], the [ ] enclose a predicate that consist of the matches function, which takes two arguments. The . here represents the content of an <a> element, followed by a comma as a separator and then the regex that provides the pattern to match against.

Although talked about as a “code smell,” the //a in //a[matches(.,'^\d+$')] enables us to pick up the <a> elements wherever they are located. We did have to repair these HTML files and I don’t want to spend time debugging ephemeral HTML.

Almost there! The collection file, along with the collection function, collection('collection.xml') enables us to apply the XQuery to all the files listed in the collection file.

Finally, we surround all of the foregoing with the count function: count(collection('collection.xml')//a[matches(.,'^\d+$')]) and declare a variable to capture the result of the count function: let $count :=

So far so good? I know, tedious for XQuery jocks but not all news reporters are XQuery jocks, at least not yet!

Then we produce the results: return $count.

But 6514 files aren’t 6675 files, you said 6675 files

Yes, your right! Thanks for paying attention!

I said at the top, 6675 are duplicates or Wikileaks artifacts.

Where are the others?

If you look at User #71477, which has the file name, user_40828931.html, you will find it’s not a CIA document but part of Wikileaks administration for these documents. There are 90 such pages.

If you look at Marble Framework, which has the file name, space_15204359.html, you find it’s a CIA document but a form of indexing created by Wikileaks. There are 70 such pages.

Don’t forget the index.html page.

When added together, 6514 (duplicates), 90 (user pages), 70 (space pages), index.html, I get 6675 duplicates or Wikileaks artifacts.

What’s your total?


Tomorrow:

In Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2), I look under year0/vault7/cms/files to discover:

  1. Arguably CIA files (maybe) – 114
  2. Public documents – 109
  3. Wikileaks artifacts – 134

I say “Arguably CIA” because there are file artifacts and anomalies that warrant your attention in evaluating those files.

Powered by WordPress