Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 31, 2017

Wikileaks Marble – 676 Source Code Files – Would You Believe 295 Unique (Maybe)

Filed under: CIA,Cybersecurity,Wikileaks — Patrick Durusau @ 7:38 pm

Wikileaks released Marble Framework, described as:

Today, March 31st 2017, WikiLeaks releases Vault 7 “Marble” — 676 source code files for the CIA’s secret anti-forensic Marble Framework. Marble is used to hamper forensic investigators and anti-virus companies from attributing viruses, trojans and hacking attacks to the CIA.

Effective leaking doesn’t seem to have recommended itself to Wikileaks.

Marble-Framework-ls-lRS-devworks.txt, is an ls -lRS listing of the devworks directory.

After looking for duplicate files and starting this post, I discovered entirely duplicated directories:

Compare:

devutils/marbletester/props with devutils/marble/props.

devutils/marbletester/props/internal with devutils/marble/props/internal

devutils/marbleextensionbuilds/Marble/Deobfuscators with devutils/marble/Shared/Deobfuscators

That totals to 182 entirely duplicated files.

In Marble-Framework-ls-lRS-devworks-annotated.txt I separated files on the basis of file size. Groups of duplicate files are separated from other files with a blank line and headed by the number of duplicate copies.

I marked only exact file size matches as duplicates, even though files close in size could be the result of insignificant whitespace.

After removing the entirely duplicated directories, there remain 199 duplicate files.

With 182 files in entirely duplicated directories and 199 remaining duplicates brings us to a grand total of 381 duplicate files.

Or the quicker way to say it: Vault 7 Marble — 295 unique source code files for the CIA’s secret anti-forensic Marble Framework.

Wikileaks may be leaking the material just as it was received. But that’s very poor use of your time and resources.

Leak publishers should polish leaks until they have a fire-hardened point.

March 29, 2017

4 Billion “Records” Leaked In 2016 – How Do You Define Record?

Filed under: Cybersecurity,Journalism,News,Reporting,Security — Patrick Durusau @ 3:33 pm

The IBM X-Force Treat Intelligence Index 2017 report leaves the impression hackers are cutting through security like a hot knife through butter:

With Internet-shattering distributed-denial-of-service (DDoS) attacks, troves of records leaked through data breaches, and a renewed focus by organized cybercrime on business targets, 2016 was a defining year for security. Indeed, in 2016 more than 4 billion records were leaked, more than the combined total from the two previous years, redefining the meaning of the term “mega breach.” In one case, a single source leaked more than 1.5 billion records.1 (page 3)

The report helpfully defines terms at page 3 and in the glossary (page 29) but never defines “record.”

The 4 billion records “fact” will appear in security blogs, Twitter, business zines, mainstream media, all without asking: “What is a record?”

Here are some things that could be records:

  • account, username, password
  • medical record (1 or more pages)
  • financial record (1 or more pages)
  • CIA document (1 or more pages)
  • Tax records (1 or more pages)
  • Offshore bank data (spreadsheet, 1 or more pages
  • Presentations (PPT, 1 or more pages)
  • Accounting records (1 or more pages)
  • Emails (1 or more pages)
  • Photos, nude or otherwise

IBM’s “…4 billion records were leaked…,” is a marketing statement for IBM security services. Not a statement of fact.

Don’t make your readers dumber by repeating IBM marketing slogans without critical comments.

PS: I haven’t checked the other “facts” claimed in this document. The failure to define “record” was enough to discourage further reading.

What’s Up With Data Padding? (Regulations.gov)

Filed under: Data Quality,Fair Use,Government Data,Intellectual Property (IP),Transparency — Patrick Durusau @ 10:41 am

I forgot to mention in Copyright Troll Hunting – 92,398 Possibles -> 146 Possibles that while using LibreOffice, I deleted a large number of either N/A only or columns not relevant for troll-mining.zip.

Except as otherwise noted, after removal of “no last name,” these fields had N/A for all records except as noted:

  1. L – Implementation Date
  2. M – Effective Date
  3. N – Related RINs
  4. O – Document SubType (Comment(s))
  5. P – Subject
  6. Q – Abstract
  7. R – Status – (Posted, except for 2)
  8. S – Source Citation
  9. T – OMB Approval Number
  10. U – FR Citation
  11. V – Federal Register Number (8 exceptions)
  12. W – Start End Page (8 exceptions)
  13. X – Special Instructions
  14. Y – Legacy ID
  15. Z – Post Mark Date
  16. AA – File Type (1 docx)
  17. AB – Number of Pages
  18. AC – Paper Width
  19. AD – Paper Length
  20. AE – Exhibit Type
  21. AF – Exhibit Location
  22. AG – Document Field_1
  23. AH – Document Field_2

Regulations.gov, not the Copyright Office, is responsible for the collection and management of comments, including the bulked up export of comments.

From the state of the records, one suspects the “bulking up” is NOT an artifact of the export but represents the storage of each record.

One way to test that theory would be a query on the noise fields via the API for Regulations.gov.

The documentation for the API is out-dated, the Field References documentation lacks the Document Detail (field AI), which contains the URL to access the comment.

The closest thing I could find was:

fileFormats Formats of the document, included as URLs to download from the API

How easy/hard it will be to download attachments isn’t clear.

BTW, the comment pages themselves are seriously puffed up. Take https://www.regulations.gov/document?D=COLC-2015-0013-52236.

Saved to disk: 148.6 KB.

Content of the comment: 2.5 KB.

The content of the comment is 1.6 % of the delivered webpage.

It must have taken serious effort to achieve a 98.4% noise to 1.6% signal ratio.

How transparent is data when you have to mine for the 1.6% that is actual content?

March 28, 2017

How Not To Lose A Community’s Trust

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:09 pm

Humbled Malware Author Leaks His Own Source Code to Regain Community’s Trust by Catalin Cimpanu.

From the post:

The author of the Nuclear Bot banking trojan has leaked the source code of his own malware in a desperate attempt to regain trust and credibility in underground cybercrime forums.

Nuclear Bot, also known as NukeBot and more recently as Micro Banking Trojan and TinyNuke, is a new banking trojan that appeared on the malware scene in December 2016, when its author, a malware coder known as Gosya, started advertising it on an underground malware forum.

According to Gosya's ad, this new banking trojan was available for rent and included several features, such as:

  • Formgrabber and Web-Injection modules (Firefox, Chrome, IE, and Opera)
  • A SOCKS proxy module
  • Remote EXE file launcher module
  • Hidden VNC module that worked on Windows versions between XP and 10
  • Rootkit for 32-bit and 64-bit architectures
  • UAC bypass
  • Windows Firewall bypass
  • IBM Trusteer firewall bypass
  • Bot-killer – a mini anti-virus meant to remove all competing malware from the infected machine

Subsequent analysis from both Arbor Networks and Sixgill confirmed the trojan's deadly features. In spite of these favorable reports, Gosya's Nuclear Bot saw little adoption among cybercrime gangs, as the malware's author miserably failed to gain their trust.

See Catalin’s post for the most impressive list of social fails I have seen in years. Seriously.

More importantly, for hacker and other forums, learn the local customs. Always.

Enjoy!

Copyright Troll Hunting – 92,398 Possibles -> 146 Possibles

Filed under: Fair Use,Intellectual Property (IP) — Patrick Durusau @ 6:51 pm

When hunting copyright trolls, well, trolls of any kind, the smaller the number to be hunted the better.

The Copyright Office is conducting the Section 512 Study, which it describes as:

The United States Copyright Office is undertaking a public study to evaluate the impact and effectiveness of the safe harbor provisions contained in section 512 of title 17, United States Code.

Enacted in 1998 as part of the Digital Millennium Copyright Act (“DMCA”), section 512 established a system for copyright owners and online entities to address online infringement, including limitations on liability for compliant service providers to help foster the growth of internet-based services. Congress intended for copyright owners and internet service providers to cooperate to detect and address copyright infringements. To qualify for protection from infringement liability, a service provider must fulfill certain requirements, generally consisting of implementing measures to expeditiously address online copyright infringement.

While Congress understood that it would be essential to address online infringement as the internet continued to grow, it may have been difficult to anticipate the online world as we now know it, where each day users upload hundreds of millions of photos, videos and other items, and service providers receive over a million notices of alleged infringement. The growth of the internet has highlighted issues concerning section 512 that appear ripe for study. Accordingly, as recommended by the Register of Copyrights, Maria A. Pallante, in testimony and requested by Ranking Member Conyers at an April 2015 House Judiciary Committee hearing, the Office is initiating a study to evaluate the impact and effectiveness of section 512 and has issued a Notice of Inquiry requesting public comment. Among other issues, the Office will consider the costs and burdens of the notice-and-takedown process on large- and small-scale copyright owners, online service providers, and the general public. The Office will also review how successfully section 512 addresses online infringement and protects against improper takedown notices.

The Office received over 92,000 written submissions by the April 1, 2016 deadline for the first round of public comments. The Office then held public roundtables on May 2nd and 3rd in New York and May 12th and 13th in San Francisco to seek further input on the section 512 study. Transcripts of the New York and San Francisco roundtables are now available online. Additional written public comments are due by 11:59 pm EST on February 21, 2017 and written submissions of empirical research are due by 11:59 pm EST on March 22, 2017.

You can see the comments at: Requests for Public Comments: Digital Millennium Copyright Act Safe Harbor Provisions, all 92,398 of them.

You can even export them to a CSV file, which runs a little over 33.5 MB in size.

It is likely that the same copyright trolls who provoked this review with non-pubic comments to the Copyright Office and others posted comments, but how to find them in a sea of 92,398 comments?

Some simplifying assumptions:

No self-respecting copyright troll will use the public comment template.

grep -v "Template Form Comment" DOCKET_COLC-2015-0013.csv | wc -l

Using grep with the -v means it does NOT return matching lines. That is only lines without “Template Form Comment” will be returned.

We modify that to read:

grep -v "Template Form Comment" DOCKET_COLC-2015-0013.csv > no-form.csv

The > pipe adds the lines without “Template Form Comment” to the file no-form.csv.

Next, scanning the file we notice, “no last name/No Last Name.”

grep -iv "no last name" no-form.csv | wc -l

Where grep has -i and -v, means case is ignored for the search string “no last name” and in the file, no-form.csv. The -v option gives us only line without “no last name.”

The count without “no last name:” 3359.

A lot better than 92,398 but not really good enough.

Nearing hand-editing so I resorted to LibreOffice at this point.

Sort on column D (out of A to AI) organization. If you scroll down, line 123 has N/A for organization. The entry just prior to it is “musicnotes.” What? Where did Sony, etc., go?

Ah, LibreOffice sorted organizations and counted “N/A” as an organization’s name.

Let’s see, from row 123 to row 3293, inclusive.

Well, deleting those rows leaves us with: 183 rows.

I continued by deleting comments by anonymous, individuals, etc., and my final total is 146 rows.

Check out troll-mining.zip!

Not all copyright trolls mind you, I need to remove the Internet Archive, EFF and other people on the right side of the section 512 issue.

Who else should I remove?

Couple of reasons for a clean copyright troll list.

First, it leads to FOIA requests about other communications to the Copyright Office by the trolls in the list. Can’t ask if you don’t have a troll list.

Second, it provides locations for protests and other ways to call unwanted attention to these trolls.

Third, well, you know, unfortunate things happen to trolls. It’s a natural consequence of a life predicated upon harming others.

March 27, 2017

How Do You Spell Media Bias? M-U-S-L-I-M

Filed under: Bias,Journalism,News,Reporting — Patrick Durusau @ 4:11 pm

Disclosure: I have contempt for news reports that hype acts of terrorism. Even more so when little more than criminal acts by Muslims are bemoaned as existential threats to Western society. Just so you know I’m not in a position to offer a balanced view of Ronald Bailey’s post.

Do Muslims Commit Most U.S. Terrorist Attacks?: Nope. Not even close. by Ronald Bailey.

From the post:

“It’s gotten to a point where it’s not even being reported. In many cases, the very, very dishonest press doesn’t want to report it,” asserted President Donald Trump a month ago. He was referring to a purported media reticence to report on terror attacks in Europe. “They have their reasons, and you understand that,” he added. The implication, I think, is that the politically correct press is concealing terrorists’ backgrounds.

To bolster the president’s claims, the White House then released a list of 78 terror attacks from around the globe that Trump’s minions think were underreported. All of the attackers on the list were Muslim—and all of the attacks had been reported by multiple news outlets.

Some researchers at Georgia State University have an alternate idea: Perhaps the media are overreporting some of the attacks. Political scientist Erin Kearns and her colleagues raise that possibility in a preliminary working paper called “Why Do Some Terrorist Attacks Receive More Media Attention Than Others?

For those five years, the researchers found, Muslims carried out only 11 out of the 89 attacks, yet those attacks received 44 percent of the media coverage. (Meanwhile, 18 attacks actually targeted Muslims in America. The Boston marathon bombing generated 474 news reports, amounting to 20 percent of the media terrorism coverage during the period analyzed. Overall, the authors report, “The average attack with a Muslim perpetrator is covered in 90.8 articles. Attacks with a Muslim, foreign-born perpetrator are covered in 192.8 articles on average. Compare this with other attacks, which received an average of 18.1 articles.”

While the authors rightly question the equality of terrorist reporting, which falsely creates a link between Muslims and terrorism in the United States, I question the appropriateness of a media focus on terrorism at all.

Aside from the obvious lure that fear sells and fear of Muslims sells very well in the United States, the human cost from domestic terrorist attacks, not just those by Muslims, hardly justifies crime blotter coverage.

Consider that in 2014, there were 33,559 deaths due to gun violence and 32 from terrorism.

But as I said, fear sells and fear of Muslims sells very well.

Terrorism or more properly the fear of terrorism has been exploited to distort government priorities and to reduce the rights of all citizens. Media participation/exploitation of that fear is a matter of record.

The question now is whether the media will knowingly continue its documented bigotry or choose another course?

The paper:

Kearns, Erin M. and Betus, Allison and Lemieux, Anthony, Why Do Some Terrorist Attacks Receive More Media Attention Than Others? (March 5, 2017). Available at SSRN: https://ssrn.com/abstract=2928138

Hacking vs. Buying Passwords – Which One For You?

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 3:04 pm

You remember the Dilbert cartoon on corporate security where the pointed haired boss asks what Dilbert would do if a stranger offered to buy company secrets. Dilbert responds asking how much is the stranger offering? See the strip for the boss’ answer and Wally’s follow up question.

Danny Palmer reports the price point for employees who would sell their access, maybe less than you think.

From the post:

A cyberattack could cost an organisation millions, but an employee within your company might be willing to give an outsider access to sensitive information via their login credentials for under £200.

According to a report examining insider threats by Forcepoint, 14 percent of European employees claimed they would sell their work login credentials to an outsider for £200. And the researchers found that, of those who’d sell their credentials to an outsider, nearly half would do it for less.

That’s about $260.00 U.S. at today’s exchange rates.

Only you know your time and expense of hacking passwords and/or buying them on the dark web.

I suspect the price point is even lower in government agencies with unpopular leadership.

I haven’t seen any surveys of US employees, but I suspect employees of companies, suppliers, contractors, banks, etc., involved in oil pipeline construction are equally open to selling passwords. Given labor conditions in the US, perhaps even more so.

Not that anyone opposing a multi-generational environmental crime like an oil pipeline would commit a crime when there are so many lawful and completely ineffectual means to oppose it at hand.

PS: As recent CIA revelations demonstrate, the question isn’t if government will betray the public’s interest but when. The same is true for environmental, health and other concerns.

Peeping Toms Jump > 16,000 In UK

Filed under: Government,Privacy,Security — Patrick Durusau @ 8:23 am

The ranks of peeping toms swells by at least 16,000 in the UK:

More than 16,000 staff in the public sector empowered to examine your web browsing by Graeme Burton.

From the post:

More than 16,000 staff in the public sector and its agencies have been empower by Section 4 of the Investigatory Powers Act to snoop on people’s internet connection records.

And that’s before the estimated 4,000 staff at security agency MI5, the 5,500 at GCHQ and 2,500 at MI6 are taken into account.

That’s according to the responses from a series of almost 100 Freedom of Information (FOI) requests made in a bid to find out exactly who has the power to snoop on ordinary people’s web browsing histories under the Act.

GCHQ, the Home Office, MI6, the National Crime Agency, the Ministry of Justice, all three armed forces and Police Service of Scotland all failed to respond to the FOI requests – so the total could be much higher.

My delusion that the UK has a mostly rational government was shattered by passage of the Investigatory Powers Act. Following web browsing activity, hell, even tracking everyone and their conversations, 24 x 7, isn’t going to stop random acts of violence.

What part of random acts of violence being exactly that, random, seems to be unclear? Are there no UK academics to take up the task of proving prediction of random events is possible?

Unless and until the UK Parliament comes to its senses, the best option for avoiding UK peeping toms is to move to another country.

If re-location isn’t possible, use a VPN and a Tor browser for all web activity.

March 26, 2017

Transparency can have a prophylactic effect

Filed under: Journalism,News,Reporting,Transparency — Patrick Durusau @ 4:57 pm

Farai Chideya set out to explore:

…who reported the 2016 election, and whether political teams’ race and gender diversity had any impact on newsrooms.

That’s an important question and Chideya certainly has the qualifications and support ( fellow at Harvard’s Shorenstein Center on Media, Politics and Public Policy) to pursue it.

One problem. For reasons best known to themselves, numerous media organizations refuse to provide diversity date. (full stop)


But the most important data point for this project—numbers from newsrooms on their 2016 political team staffing—has been the hardest to collect because very few managers or business-side staff are willing to disclose their data. One company admitted off the record that they were not responding to diversity requests, period. The Wall Street Journal provided the statement that it “declined to provide specific personnel information.” An organization sent numbers for its corporate parent company, whose size is approximately a thousand times the size of the entire news team, let alone the political team. Another news manager promised verbally to cooperate with the inquiry, but upon repeated follow up completely ghosted.

Concealment wasn’t the uniform response as Chideya makes clear but useful responses were so few and far. Enough so to provoke her post.

She captures my sentiments writing:


If we journalists can’t turn as unsparing a gaze on ourselves as we do on others, it speaks poorly for us and the credibility of our profession. If the press lauds itself for demanding transparency from government but cannot achieve transparency in its newsrooms, that is cowardice. If we say we can cover all of America with representatives of only a few types of communities, we may win battles but lose the war to keep news relevant to a broad segment of Americans. This is as strong a business argument as a moral argument.

If you need additional motivation, be aware that Chideya is proceeding in the face of non-cooperation and when her study is published, there will be a list of who has been naughty and nice.

Here’s how to self-report:


Whether or not you are a news organization I’ve already contacted, please email me at Farai_Chideya@hks.harvard.edu

For the purposes of the reporting, I’m looking for a race/gender count of 2016-cycle political staffers—full-time or at least 25-hour-per-week contract workers (but not freelancers paid by the story). People come and go during the election season, but these should be people who spent at least six months covering the election between September 2015 and November 2016.

If you want to add to the data you disclose, you can include separate counts for freelancers; or for staff who worked on politics less than six months of the cycle, but those should be broken out separately.

Want bonus points? Produce an org chart showing how your staff diversity played out across the ranks of reporters and editors. Feel free to annotate for self-reported class background or other metrics if you want, too. But race and gender are the minimum.

We’d like on-the-record numbers and interviews from people who we can use as sources in the report: managers, corporate communications staff, anyone authorized to speak on behalf of the newsroom. Please indicate if you are speaking on the record and in what role.

Because we are not getting this information, in many cases, we also welcome interviews and information on background. That is, if you are a staffer and can provide information, please do, and tell us who you are and that you don’t want to be quoted or cited. We’ll take what you provide to us into account as we do our research, but obviously it can’t be the final word. You could also offer quotes about the topic on the record, and your assessment of staff diversity on background.

As we conclude the report, we will release information on who has provided information, and who it was requested from who did not.

Self-reporting beats being on the naughty list and/or your diversity information extracted by a ham-handed hacker who damages your systems as well.

Who knew? Transparency can have a prophylactic effect.

See Chideya’s full post at: One question that turns courageous journalists into cowards

March 25th – Anniversary Of Triangle Fire – The Names Map

Filed under: Government,Maps,Politics — Patrick Durusau @ 11:09 am

The Names Map

From the website:

The Names Map displays the name, home address, likely age, country of origin, and final resting place of all known Triangle Fire victims.

(map and list of 146 victims)

The Remember the Triangle Fire Coalition connects individuals and organizations with the 1911 Triangle Factory Fire — one of the pivotal events in US history and a turning point in labor’s struggle to achieve fair wages, dignity at work and safe working conditions. Outrage at the deaths of 146 mostly young, female immigrants inspired the union movement and helped to institute worker protections and fire safety laws. Today, basic rights and benefits in the workplace are not a guarantee in the United States or across the world. We believe it is more vital than ever that these issues are defended.

The “not guilty” verdict on all counts of manslaughter for Triangle Factory owners Max Blanck and Issac Harris:

is often overlooked in anniversary celebrations. (Image from Cornell University, ILR School, Kheel Center’s Remembering The 1911 Triangle Factory Fire, Transcript of Criminal Trial)

That verdict is a forerunner to the present day decisions to not prosecute police shootings/abuse of unarmed civilians.

Celebrate the progress made since the 1911 Triangle Factory Fire while mindful exploitation and abuse continue to this very day.

The Remember the Triangle Fire Coalition has assembled a large number of resources, many of which are collections of other resources, including primary materials.

Politics For Your Twitter Feed

Filed under: Government,Politics,Tweets,Twitter — Patrick Durusau @ 8:28 am

Hungry for more political tweets?

GovTrack created the Members of Congress Twitter list.

Barometer of congressional mood?

Enjoy!

March 25, 2017

Your maps are not lying to you

Filed under: Mapping,Maps,Topic Maps — Patrick Durusau @ 8:34 pm

Your maps are not lying to you by Andy Woodruff.

From the post:

Or, your maps are lying to you but so would any other map.

A week or two ago [edit: by now, sometime last year] a journalist must have discovered thetruesize.com, a nifty site that lets you explore and discover how sizes of countries are distorted in the most common world map, and thus was born another wave of #content in the sea of web media.

Your maps are lying to you! They are WRONG! Everything you learned is wrong! They are instruments of imperial oppressors! All because of the “monstrosity” of a map projection, the Mercator projection.

Technically, all of that is more or less true. I love it when little nuggets of cartographic education make it into popular media, and this is no exception. However, those articles spend most of their time damning the Mercator projection, and relatively little on the larger point:

There are precisely zero ways to draw an accurate map on paper or a screen. Not a single one.

In any bizarro world where a different map is the standard, the internet is still abuzz with such articles. The only alternatives to that no-good, lying map of yours are other no-good, lying maps.

Andy does a great job of covering the reasons why maps (in the geographic sense) are less than perfect for technical (projection) as well as practical (abstraction, selection) reasons. He also offers advice on how to critically evaluate a map for “bias.” Or at least possibly discovering some of its biases.

For maps of all types, including topic maps, the better question is:

Does the map represent the viewpoint you were paid to represent?

If yes, it’s a great map. If no, your client will be unhappy.

Critics of maps, whether they admit it or not, are inveighing for a map as they would have created it. That should be on their dime and not yours.

Looking For Installed Cisco Routers?

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 7:50 pm

News of 300 models of Cisco Catalyst switches being vulnerable to a simple Telnet attack, Cisco issues critical warning after CIA WikiLeaks dump bares IOS security weakness by Michael Cooney, for example, has piqued interest in installed Cisco routers.

You already know that Nmap can uncover and identify routers.

What you may not know is government hemorrhaging of IT information may be a useful supplement to Nmap.

Consider GovernmentBids.com for example.

You can search by federal government bid types and/or one or more of the fifty states. Up to 999 prior to the current date, for bids, which includes the bids as well as the winning vendor.

If you are routinely searching for IT vulnerability information, I would not begrudge them the $131/month fee for full information on bids.

From a topic map perspective, pairing IT bid information with vulnerability reports, would be creative and valuable intelligence.

How much IT information is your office/department hemorrhaging?

March 24, 2017

Attn: Zero-Day Hunters, ATMs Running Windows XP Have Cash

Filed under: Cybersecurity,Security — Patrick Durusau @ 8:19 pm

Kimberly Crawley reprises her Do ATMs running Windows XP pose a security risk? You can bank on it! as a reminder that bank ATMs continue to run Windows XP.

Her post was three years old in February, 2017 and just as relevant as the first day of its publication.

Rather than passing even more unenforceable hacking legislation, states and congress should impose treble damages with mandatory attorney’s fees on commercial victims of hacking attacks.

Insecurity will become a cost center in their budgets, justifying realistic spending and demand for more secure software.

In the meantime, remember ATMs running Windows XP dispense cash.

Other Methods for Boiling Vault 7: CIA Hacking Tools Revealed?

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 2:58 pm

You may have other methods for boiling content out of the Wikileaks Vault 7: CIA Hacking Tools Revealed.

To that end, here is the list of deduped files.

Warning: The Wikileaks pages I have encountered are so malformed that repair will always be necessary before using XQuery.

Enjoy!

Efficient Querying of Vault 7: CIA Hacking Tools Revealed

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 1:42 pm

This week we have covered:

  1. Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1) Eliminated duplication and editorial artifacts, 1134 HTML files out of 7809 remain.
  2. Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2 – The PDF Files) Eliminated public and placeholder documents, 114 arguably CIA files remain.
  3. CIA Documents or Reports of CIA Documents? Vault7 All of the HTML files are reports of possibly CIA material but we do know HTML file != CIA document.
  4. Boiling Reports of CIA Documents (Wikileaks CIA Vault 7 CIA Hacking Tools Revealed) The HTML files contain a large amount of cruft, which can be extracted using XQuery and common tools.

Interesting, from a certain point of view, but aside from highlighting bloated leaking from Wikileaks, why should anyone care?

Good question!

Let’s compare the de-duped but raw with the de-duped but boiled document set.

De-duped but raw document set:

De-duped and boiled document set:

In raw count, boiling took us from 2,131,135 words/tokens to 665,202 words/tokens.

Here’s a question for busy reporters/researchers:

Has the CIA compromised the Tor network?

In the raw files, Tor occurs 22660 times.

In the boiled files, Tor occurs 4 times.

Here’s a screen shot of the occurrences:

With TextSTAT, select the occurrence in the concordance and another select (mouse click to non-specialists) takes you to:

In a matter of seconds, you can answer as far as the HTML documents of Vault7 Part1 show, the CIA is talking about top of rack (ToR), a switching architecture for networks. Not, the Tor Project.

What other questions do you want to pose to the boiled Vault 7: CIA Hacking Tools Revealed document set?

Tooling up for efficient queries

First, you need: Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.

Second, grab a copy of: TextSTAT – Simple Text Analysis Tool runs on Windows, GNU/Linux and MacOS. (free)

When you first open TextSTAT, it will invite you to create a copora.

The barrel icon to the far left creates a new corpora. Select it and:

Once you save the new corpora, this reminder about encodings pops up:

I haven’t explored loading Windows files while on a Linux box but will and report back. Interesting to see inclusion of PDF. Something we need to explore after we choose which of the 124 possibly CIA PDF files to import.

Finally, you are at the point of navigating to where you have stored the unzipped Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files:

Select the first file, scroll to the end of the list, press shift and select the last file. Then choose OK. It takes a minute or so to load but it has a progress bar to let you know it is working.

Observations on TextSTAT

As far as I can tell, TextSTAT doesn’t use the traditional stop list of words but enables you to set of maximum and minimum occurrences in the Word Form window. Along with wildcards as well. More flexible than the old stop list practice.

BTW, the context right/left on the Concordance window refers to characters, not words/tokens. Another departure from my experience with concordances. Not a criticism, just an observation of something that puzzled me at first.

Conclusion

The benefit of making secret information available, a la Wikileaks cannot be over-stated.

But making secret information available isn’t the same as making it accessible.

Investigators, reporters, researchers, the oft-mentioned public, all benefit from accessible information.

Next week look for a review of the probably CIA PDF files to see which ones I would incorporate into the corpora. (You may include more or less.)

PS: I’m looking for telecommuting work, editing, research (see this blog), patrick@durusau.net.

March 23, 2017

Boiling Reports of CIA Documents (Wikileaks CIA Vault 7 CIA Hacking Tools Revealed)

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 7:35 pm

Before you read today’s installment on the Wikileaks CIA Vault 7 CIA Hacking Tools Revealed, you should check out the latest drop from Wikileaks: CIA Vault 7 Dark Matter. Five documents and they all look interesting.

I started with a fresh copy of the HTML files in a completely new directory and ran Tidy first, plus fixed:

page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>

which I described in: Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1).

So with a clean and well-formed set of files, I modified the XQuery to collect all of the references to prior versions. Reasoning that any file that was a prior version, we can ditch that, leaving only the latest files.


for $doc in collection('collection.xml')//a[matches(.,'^\d+$')]
return ($doc/@href/string(), ' ')

Unlike the count function we used before, this returns the value of the href attribute and appends a new line, after each one.

I saved that listing to priors.txt and then (your experience may vary on this next step):

xargs rm < priors.txt

WARNING: If your file names have spaces in them, you may delete files unintentionally. My data had no such spaces so this works in this case.

Once I had the set of files without those representing “previous versions,” I’m down to the expected 1134.

That’s still a fair number of files and there is a lot of cruft in them.

For variety I did look at XSLT, but these are some empty XSLT template statements needed to clean these files:

<xsl:template match="style"/>

<xsl:template match="//div[@id = 'submit_wlkey']" />

<xsl:template match="//div[@id = 'submit_help_contact']" />

<xsl:template match="//div[@id = 'submit_help_tor']" />

<xsl:template match="//div[@id = 'submit_help_tips']" />

<xsl:template match="//div[@id = 'submit_help_after']" />

<xsl:template match="//div[@id = 'submit']" />

<xsl:template match="//div[@id = 'submit_help_buttons']" />

<xsl:template match="//div[@id = 'efm-button']" />

<xsl:template match="//div[@id = 'top-navigation']" />

<xsl:template match="//div[@id = 'menu']" />

<xsl:template match="//footer" />

<xsl:template match="//script" />

<xsl:template match="//li[@class = 'comment']" />

Compare the XQuery query, on the command line no less:

for file in *.html; do
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -s:"$file"
-qs:"/html/body//div[@id = 'uniquer']" -o:"$file.new"
done

(The line break in front of -qs: is strictly for formatting for this post.)

The files generated here will not be valid HTML.

Easy enough to fix with another round of Tidy.

After running Tidy, I was surprised to see a large number of very small files. Or at least I interpret 296 files of less than 1K in size to be very small files.

I created a list of them, linked back to the Wikileaks originals (296 Under 1K Wikileaks CIA Vault 7 Hacking Tools Revealed Files) so you can verify that I capture the content reported by Wikileaks. Oh, and here are the files I generated as well, Boiled Content of Unique CIA Vault 7 Hacking Tools Revealed Files.

In case you are interested, boiling the 1134 files took them from 38.6 MB to 8.8 MB of actual content for indexing, searching, concordances, etc.

Using the content only files, tomorrow I will illustrate how you can correlate information across files. Stay tuned!

March 22, 2017

The New Handbook For Cyberwar Is Being Written By Russia

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 8:35 pm

The New Handbook For Cyberwar Is Being Written By Russia by Sheera Frenkel.

From the post:


One US intelligence officer currently involved in cyber ops said, “It’s not that the Russians are doing something others can’t do. It’s not as though, say, the US wouldn’t have the technical skill level to carry out those types of attacks. It’s that Russian hackers are willing to go there, to experiment and carry out attacks that other countries would back away from,” said the officer, who asked not to be quoted by name due to the sensitivity of the subject. “It’s audacious, and reckless. They are testing things out in the field and refining them, and a lot of it is very, very messy and some is very smart.”

Well, “…testing things out in the field and refining them…” is the difference between a potential weapon on a dry erase board and a working weapon in practice. Yes?

Personally I favor the working weapon in practice.

It’s an interesting read despite the repetition of the now debunked claim of Wikileaks releasing 8,761 CIA documents (Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1))

Frenkel of course covers the DNC hack:


The hack on the DNC, which US intelligence agencies have widely attributed to Russia, could be replicated by dozens of countries around the world, according to Robert Knake, a former director of cybersecurity policy in the Obama administration.

“Russia has laid out the playbook. What Russia did was relatively unsophisticated and something that probably about 60 countries around the world have the capability of doing — which is to target third parties, to steal documents and emails, and to selectively release them to create unfavorable conditions for that party,” Knake told the BBC’s Today. “It’s unsubtle interference. And it’s a violation of national sovereignty and customary law.”

Kanke reflects the failure of major powers to understand the leveling potential of cyberwarfare. Sixty countries? You think? How about every kid that can run a phishing scam to steal John Podesta’s password? How many? 600,000 maybe? More than that?

None of who care about “…national sovereignty and customary law.”

Are you going to write or be described in a chapter of the new book on cyberwar?

Your call.

Leak Publication: Sharing, Crediting, and Re-Using Leaks

Filed under: Data Provenance,Leaks,Open Access,Open Data — Patrick Durusau @ 4:53 pm

If you substitute “leak” for “data” in this essay by Daniella Lowenberg, does it work for leaks as well?

Data Publication: Sharing, Crediting, and Re-Using Research Data by Daniella Lowenberg.

From the post:

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 – a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

As much as I admire the work of the International Consortium of Investigative Journalists (ICIJ, especially its Panama Papers project, sharing data beyond the confines of their community isn’t a value, much less a goal.

As all secret keepers, government, industry, organizations, ICIJ has “reasons” for its secrecy, but none that I find any more or less convincing than those offered by other secret keepers.

Every secret keeper has an agenda their secrecy serves. Agendas that which don’t include a public empowered to make judgments about their secret keeping.

The ICIJ proclaims Leak to Us.

A good place to leak but include with your leak a demand, an unconditional demand, that your leak be released in its entirely within a year or two of its first publication.

Help enable the public to watch all secrets and secret keepers, not just those some secret keepers choose to expose.

When To Worry About CIA’s Zero-Day Exploits

Filed under: CIA,Cybersecurity,Security — Patrick Durusau @ 3:58 pm

Chris McNab’s Alexsey’s TTPs (.. Tactics, Techniques, and Procedures) post on Alexsey Belan provides a measure for when to worry about Zero-Day exploits held by the CIA.

McNab lists:

  • Belan’s 9 offensive characteristics
  • 5 defensive controls
  • WordPress hack – 12 steps
  • LinkedIn targeting – 11 steps
  • Third victim – 11 steps

McNab observes:


Consider the number of organizations that provide services to their users and employees over the public Internet, including:

  • Web portals for sales and marketing purposes
  • Mail access via Microsoft Outlook on the Web and Google Mail
  • Collaboration via Slack, HipChat, SharePoint, and Confluence
  • DevOps and support via GitHub, JIRA, and CI/CD utilities

Next, consider how many enforce 2FA across their entire attack surface. Large enterprises often expose domain-joined systems to the Internet that can be leveraged to provide privileged network access (via Microsoft IIS, SharePoint, and other services supporting NTLM authentication).

Are you confident safe 2FA is being enforced over your entire attack surface?

If not, don’t worry about potential CIA held Zero-Day exploits.

You’re in danger from script kiddies, not the CIA (necessarily).

Alexsey Belan made the Most Wanted list at the FBI.

Crimes listed:

Conspiring to Commit Computer Fraud and Abuse; Accessing a Computer Without Authorization for the Purpose of Commercial Advantage and Private Financial Gain; Damaging a Computer Through the Transmission of Code and Commands; Economic Espionage; Theft of Trade Secrets; Access Device Fraud; Aggravated Identity Theft; Wire Fraud

His FBI poster runs two pages but you could edit off the bottom of the first page to make it suitable for framing.

😉

Try hanging that up in your local university computer lab to test their support for free speech.

CIA Documents or Reports of CIA Documents? Vault7

Filed under: CIA,Leaks,Wikileaks — Patrick Durusau @ 3:17 pm

As I tool up to analyze the 1134 non-duplicate/artifact HTML files in Vault 7: CIA Hacking Tools Revealed, it occurred to me those aren’t “CIA documents.”

Take Weeping Angel (Extending) Engineering Notes as an example.

Caveat: My range of experience with “CIA documents” is limited to those obtained by Michael Best and others using Freedom of Information Act requests. But that should be sufficient to identify “CIA documents.”

Some things I notice about Weeping Angel (Extending) Engineering Notes:

  1. A Wikileaks header with donation button.
  2. “Vault 7: CIA Hacking Tools Revealed”
  3. Wikileaks navigation
  4. reported text
  5. More Wikileaks navigation
  6. Ads for Wikileaks, Tor, Tails, Courage, bitcoin

I’m going to say that the 1134 non-duplicate/artifact HTML files in Vault7, Part1, are reports of portions (which portions is unknown) of some unknown number of CIA documents.

A distinction that influences searching, indexing, concordances, word frequency, just to name a few.

What I need is the reported text, minus:

  1. A Wikileaks header with donation button.
  2. “Vault 7: CIA Hacking Tools Revealed”
  3. Wikileaks navigation
  4. More Wikileaks navigation
  5. Ads for Wikileaks, Tor, Tails, Courage, bitcoin

Check in tomorrow when I boil 1134 reports of CIA documents to get something better suited for text analysis.

XQuery 3.1 and Company! (Deriving New Versions?)

Filed under: XML,XPath,XQuery — Patrick Durusau @ 9:09 am

XQuery 3.1: An XML Query Language W3C Recommendation 21 March 2017

Hurray!

Related reading of interest:

XML Path Language (XPath) 3.1

XPath and XQuery Functions and Operators 3.1

XQuery and XPath Data Model 3.1

These recommendations are subject to licenses that read in part:

No right to create modifications or derivatives of W3C documents is granted pursuant to this license, except as follows: To facilitate implementation of the technical specifications set forth in this document, anyone may prepare and distribute derivative works and portions of this document in software, in supporting materials accompanying software, and in documentation of software, PROVIDED that all such works include the notice below. HOWEVER, the publication of derivative works of this document for use as a technical specification is expressly prohibited.

You know I think the organization of XQuery 3.1 and friends could be improved but deriving and distributing “improved” versions is expressly prohibited.

Hmmm, but we are talking about XML and languages to query and transform XML.

Consider the potential of an query that calls XQuery 3.1: An XML Query Language and materials cited in it, then returns a version of XQuery 3.1 that has definitions from other standards off-set in the XQuery 3.1 text.

Or than inserts into the text examples or other materials.

For decades XML enthusiasts have bruited about dynamic texts but have produced damned few of them (as in zero) as their standards.

Let’s use the “no derivatives” language of the W3C as an incentive to not create another static document but a dynamic one that can grow or contract according to the wishes of its reader.

Suggestions for first round features?

March 21, 2017

Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2 – The PDF Files)

Filed under: Leaks,Vault 7,Wikileaks — Patrick Durusau @ 4:21 pm

You may want to read Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1) before reading this post. In Part 1, I walk you through obtaining a copy of Wikileaks’ Vault 7: CIA Hacking Tools Revealed so you can follow and check my analysis and conclusions.

Fact checking applies to every source, including this blog.

I proofed my listing of the 357 PDF files in the first Vault 7 release and report an increase in arguably CIA files and a slight decline in public documents. An increase from 114 to 125 for the CIA and a decrease from 109 to 98 for public documents.

  1. Arguably CIA – 125
  2. Public – 98
  3. Wikileaks placeholders – 134

The listings to date:

  1. CIA (maybe)
  2. Public documents
  3. Wikileaks placeholders

For public documents, I created hyperlinks whenever possible. (Saying a fact and having evidence are two different things.) Vendor documentation that was not marked with a security classification I counted as public.

All I can say for the Wikileaks placeholders, some 134 of them, is to ignore them unless you like mining low grade ore.

I created notes in the CIA listing to help narrow your focus down to highly relevant materials.

I have more commentary in the works but I wanted to release these listings in case they help others make efficient use of their time.

Enjoy!

PS: A question I want to start addressing this week is how the dilution of a leak impacts the use of same?

March 20, 2017

Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 1)

Filed under: CIA,Leaks,Vault 7,Wikileaks — Patrick Durusau @ 10:29 am

Executive Summary:

If you reported Vault 7: CIA Hacking Tools Revealed as containing:

8,761 documents and files from an isolated, high-security network situated inside the CIA’s Center for Cyber Intelligence in Langley, Virgina…. (Vault 7: CIA Hacking Tools Revealed)

you failed to check your facts.

I detail my process below but in terms of numbers:

  1. Of 7809 HTML files, 6675 are duplicates or Wikileaks artifacts
  2. Of 357 PDF files, 134 are Wikileaks artifacts (for materials not released). Of the remaining 223 PDF files, 109 of them are public information, the GNU Make Manual for instance. Out of the 357 pdf files, Wikileaks has delivered 114 arguably from the CIA and some of those are doubtful. (Part 2, forthcoming)

Wikileaks haters will find little solace here. My criticisms of Wikileaks are for padding the leak and not enabling effective use of the leak. Padding the leak is obvious from the inclusion of numerous duplicate and irrelevant documents. Effective use of the leak is impaired by the padding but also by teases of what could have been released but wasn’t.

Getting Started

To start on common ground, fire up a torrent client, obtain and decompress: Wikileaks-Year-Zero-2017-v1.7z.torrent.

Decompression requires this password: SplinterItIntoAThousandPiecesAndScatterItIntoTheWinds

The root directory is year0.

When I run a recursive ls from above that directory:

ls -R year0 | wc -l

My system reports: 8820

Change to the year0 directory and ls reveals:

bootstrap/ css/ highlighter/ IMG/ localhost:6081@ static/ vault7/

Checking the files in vault7:

ls -R vault7/ | wc -l

returns: 8755

Change to the vault7 directory and ls shows:

cms/ files/ index.html logo.png

The directory files/ has only one file, org-chart.png. An organization chart of the CIA but with sub-departments are listed with acronyms and “???.” Did the author of the chart not know the names of those departments? I point that out as the first of many file anomalies.

Some 7809 HTML files are found under cms/.

The cms/ directory has a sub-directory files, plus main.css and 7809 HTML files (including the index.html file).

Duplicated HTML Files

I discovered duplication of the HTML files quite by accident. I had prepared the files with Tidy for parsing with Saxon and compressed a set of those files for uploading.

The 7808 files I compressed started at 296.7 MB.

The compressed size, using 7z, was approximately 3.5 MB.

That’s almost 2 order of magnitude of compression. 7z is good, but it’s not quite that good. 😉

Checking my file compression numbers

You don’t have to take my word for the file compression experience. If you select all the page_*, space_* and user_* HTML files in a file browser, it should report a total size of 296.7 MB.

Create a sub-directory to year0/vault7/cms/, say mkdir htmlfiles and then:

cp *.html htmlfiles

Then: cd htmlfiles

and,

7z a compressedhtml.7z *.html

Run: ls -l compressedhtml.7z

Result: 3488727 Mar 16 16:31 compressedhtml.7z

Tom Harris, in How File Compression Works, explains that:


Most types of computer files are fairly redundant — they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.

If you don’t agree the HTML file are highly repetitive, check the next section where one source of duplication is demonstrated.

Demonstrating Duplication of HTML files

Let’s start with the same file as we look for a source of duplication. Load Cinnamon Cisco881 Testing at Wikileaks into your browser.

Scroll to near the bottom of the file where you see:

Yes! There are 136 prior versions of this alleged CIA file in the directory.

Cinnamon Cisco881 Testinghas the most prior versions but all of them have prior versions.

Are we now in agreement that duplicated versions of the HTML pages exist in the year0/vault7/cms/ directory?

Good!

Now we need to count how many duplicated files there are in year0/vault7/cms/.

Counting Prior Versions of the HTML Files

You may or may not have noticed but every reference to a prior version takes the form:

<a href=”filename.html”>integer</a*gt;

That going to be an important fact but let’s clean up the HTML so we can process it with XQuery/Saxon.

Preparing for XQuery

Before we start crunching the HTML files, let’s clean them up with Tidy.

Here’s my Tidy config file:

output-xml: yes
quote-nbsp: no
show-warnings: no
show-info: no
quiet: yes
write-back: yes

In htmlfiles I run:

tidy -config tidy.config *.html

Tidy reports two errors:


line 887 column 1 - Error: is not recognized!
line 887 column 15 - Error: is not recognized!

Grepping for “declarations>”:

grep "declarations" *.html

Returns:

page_26345506.html:<declarations><string name="½ö"></string></declarations><p>›<br>

The string element is present as well so we open up the file and repair it with XML comments:

<!-- <declarations><string name="½ö"></string></declarations><p>›<br> -->
<!-- prior line commented out to avoid Tidy error, pld 14 March 2017-->

Rerun Tidy:

tidy -config tidy.config *.html

Now Tidy returns no errors.

XQuery Finds Prior Versions

Our files are ready to be queried but 7809 is a lot of files.

There are a number of solutions but a simple one is to create an XML collection of the documents and run our XQuery statements across the files as a set.

Here’s how I created a collection file for these files:

I did an ls in the directory and piped that to collection.xml. Opening the file I deleted index.html, started each entry with <doc href=" and ended each one with "/>, inserted <collection> before the first entry and </collection> after the last entry and then saved the file.

Your version should look something like:

<collection>
  <doc href="page_10158081.html"/>
  <doc href="page_10158088.html"/>
  <doc href="page_10452995.html"/>
...
  <doc href="user_7995631.html"/>
  <doc href="user_8650754.html"/>
  <doc href="user_9535837.html"/>
</collection>

The prior versions in Cinnamon Cisco881 Testing from Wikileaks, have this appearance in HTML source:

<h3>Previous versions:</h3>
<p>| <a href=”page_17760540.html”>1</a> <span class=”pg-tag”><i>empty</i></span>
| <a href=”page_17760578.html”>2</a> <span class=”pg-tag”></span>

…..

| <a href=”page_23134323.html”>135</a> <span class=”pg-tag”>[Xetron]</span>
| <a href=”page_23134377.html”>136</a> <span class=”pg-tag”>[Xetron]</span>
|</p>
</div>

You will need to spend some time with the files (I have obviously) to satisfy yourself that <a> elements that contain only numbers are exclusively used for prior references. If you come across any counter-examples, I would be very interested to hear about them.

To get a file count on all the prior references, I used:

let $count := count(collection('collection.xml')//a[matches(.,'^\d+$')])
return $count

Run that script to find: 6514 previous editions of the base files

Unpacking the XQuery

Rest assured that’s not how I wrote the first XQuery on this data set! 😉

Without exploring all the by-ways and alleys I traversed, I will unpack that query.

First, the goal of the query is to identify every <a> element that only contains digits. Recalling that previous versions link have digits only in their <a> elements.

A shout out to Jonathan Robie, Editor of XQuery, for reminding me that string expressions match substrings unless they are have beginning and ending line anchors. Here:

'^\d+$'

The \d matches only digits, the + enables matching 1 or more digits, and the beginning ^ and ending $ eliminate any <a> elements that might start with one or more digits, but also contains text. Like links to files, etc.

Expanding out a bit more, [matches(.,'^\d+$')], the [ ] enclose a predicate that consist of the matches function, which takes two arguments. The . here represents the content of an <a> element, followed by a comma as a separator and then the regex that provides the pattern to match against.

Although talked about as a “code smell,” the //a in //a[matches(.,'^\d+$')] enables us to pick up the <a> elements wherever they are located. We did have to repair these HTML files and I don’t want to spend time debugging ephemeral HTML.

Almost there! The collection file, along with the collection function, collection('collection.xml') enables us to apply the XQuery to all the files listed in the collection file.

Finally, we surround all of the foregoing with the count function: count(collection('collection.xml')//a[matches(.,'^\d+$')]) and declare a variable to capture the result of the count function: let $count :=

So far so good? I know, tedious for XQuery jocks but not all news reporters are XQuery jocks, at least not yet!

Then we produce the results: return $count.

But 6514 files aren’t 6675 files, you said 6675 files

Yes, your right! Thanks for paying attention!

I said at the top, 6675 are duplicates or Wikileaks artifacts.

Where are the others?

If you look at User #71477, which has the file name, user_40828931.html, you will find it’s not a CIA document but part of Wikileaks administration for these documents. There are 90 such pages.

If you look at Marble Framework, which has the file name, space_15204359.html, you find it’s a CIA document but a form of indexing created by Wikileaks. There are 70 such pages.

Don’t forget the index.html page.

When added together, 6514 (duplicates), 90 (user pages), 70 (space pages), index.html, I get 6675 duplicates or Wikileaks artifacts.

What’s your total?


Tomorrow:

In Fact Checking Wikileaks’ Vault 7: CIA Hacking Tools Revealed (Part 2), I look under year0/vault7/cms/files to discover:

  1. Arguably CIA files (maybe) – 114
  2. Public documents – 109
  3. Wikileaks artifacts – 134

I say “Arguably CIA” because there are file artifacts and anomalies that warrant your attention in evaluating those files.

March 19, 2017

UK Proposes to Treat Journalists As Spies (Your Response Here)

Filed under: Censorship,Free Speech,Journalism,News,Reporting — Patrick Durusau @ 4:33 pm

UK’s proposed Espionage Act will treat journalists like spies by Roy Greenslade.

From the post:

Journalists in Britain are becoming increasingly alarmed by the government’s apparent determination to prevent them from fulfilling their mission to hold power to account. The latest manifestation of this assault on civil liberties is the so-called Espionage Act. If passed by parliament, it could lead to journalists who obtain leaked information, along with the whistle blowers who provide it to them, serving lengthy prison sentences.

In effect, it would equate journalists with spies, and its threat to press freedom could not be more stark. It would not so much chill investigative journalism as freeze it altogether.

The proposal is contained in a consultation paper, “Protection of Official Data,” which was drawn up by the Law Commission. Headed by a senior judge, the commission is ostensibly independent of government. Its function is to review laws and recommend reforms to ensure they are fairer and more modern.

But fairness is hardly evident in the proposed law. Its implications for the press were first highlighted in independent news website The Register by veteran journalist Duncan Campbell, who specializes in investigating the U.K. security services.

Comments on the public consultation document can be registered here.

Greenslade reports criticism of the proposal earned this response from the government:


In response, both Theresa’s May’s government and the Law Commission stressed that it was an early draft of the proposed law change. Then the commission followed up by extending the public consultation period by a further month, setting a deadline of May 3.

Early draft, last draft or the final form from parliament, journalists should treat the proposed Espionage Act as a declaration of war on the press.

Being classified as spies, journalists should start acting as spies. Spies that offer no quarter and who take no prisoners.

Develop allies in other countries who are willing to publish information detrimental to your government.

The government has chosen a side and it’s not yours. What more need be said?

March 18, 2017

Congress API Update

Filed under: Government,Politics — Patrick Durusau @ 9:09 pm

Congress API Update by Derek Willis.

From the post:

When we took over projects from the Sunlight Foundation last year, we inherited an Application Programming Interface, or API, that overlapped with one of our own.

Sunlight’s Congress API and ProPublica’s Congress API are similar enough that we decided to try to merge them together rather than run them separately, and to do so in a way that makes as few users change their code as possible.

Today we’ve got an update on our progress.

Users of the ProPublica Congress API can now access additional fields in responses for Members, Bills, Votes and Nominations. We’ve updated our documentation to provide examples of those responses. These aren’t new responses but existing ones that now include some new attributes brought over from the Sunlight API. Details on those fields are here.

We plan to fold in Sunlight fields and responses for Committees, Hearings, Floor Updates and Amendments, though that work isn’t finished yet.

The daily waves of bad information on congressional legislation will not be stopped by good information.

However, good information can be used to pick meaningful fights, rather than debating 140 character or less brain farts.

Your choice.

RegexBuddy (Think Occur Mode for Emacs)

Filed under: Regexes,Searching — Patrick Durusau @ 4:44 pm

RegexBuddy

From the webpage:

RegexBuddy is your perfect companion for working with regular expressions. Easily create regular expressions that match exactly what you want. Clearly understand complex regexes written by others. Quickly test any regex on sample strings and files, preventing mistakes on actual data. Debug without guesswork by stepping through the actual matching process. Use the regex with source code snippets automatically adjusted to the particulars of your programming language. Collect and document libraries of regular expressions for future reuse. GREP (search-and-replace) through files and folders. Integrate RegexBuddy with your favorite searching and editing tools for instant access.

Learn all there is to know about regular expressions from RegexBuddy’s comprehensive documentation and regular expression tutorial.

I was reminded of RegexBuddy when I stumbled on the RegexBuddy Manual in a search result.

The XQuery/XPath regex treatment is far briefer than I would like but at 500+ pages, it’s an impressive bit of work. Even without a copy of RegexBuddy, working through the examples will make you a regex terrorist.

The only unfortunate aspect, for *nix users, is that you need to run RegexBuddy in a Windows VM. 🙁

If you are comfortable with Emacs, Windows or otherwise, then the Occur mode comes to mind. It doesn’t have the visuals of RegexBuddy but then you are accustomed to a power-user environment.

In terms of productivity, it’s hard to beat regexes. I passed along a one liner awk regex tip today to extract content from a “…pile of nonstandard multiply redundant JavaScript infested pseudo html.”

I’ve seen the HTML in question. The description seems a bit generous to me. 😉

Try your hand at regexes and see if your productivity increases!

March 16, 2017

Balisage Papers Due in 3 Weeks!

Filed under: Conferences,XML,XQuery,XSLT — Patrick Durusau @ 9:04 pm

Apologies for the sudden lack of posting but I have been working on a rather large data set with XQuery and checking forwards and backwards to make sure it can be replicated. (I hate “it works on my computer.”)

Anyway, Tommie Usdin dropped an email bomb today with a reminder that Balisage papers are due on April 7, 2017.

From her email:

Submissions to “Balisage: The Markup Conference” and pre-conference symposium:
“Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions”
are on April 7.

It is time to start writing!

Balisage: The Markup Conference 2017
August 1 — 4, 2017, Rockville, MD (a suburb of Washington, DC)
July 31, 2017 — Symposium Up-Translation and Up-Transformation
https://www.balisage.net/

Balisage: where serious markup practitioners and theoreticians meet every August. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

• Web application development with XML
• Informal data models and consensus-based vocabularies
• Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
• Performance issues in parsing, XML database retrieval, or XSLT processing
• Development of angle-bracket-free user interfaces for non-technical users
• Semistructured data and full text search
• Deployment of XML systems for enterprise data
• Web application development with XML
• Design and implementation of XML vocabularies
• Case studies of the use of XML for publishing, interchange, or archiving
• Alternatives to XML
• the role(s) of XML in the application lifecycle
• the role(s) of vocabularies in XML environments

Detailed Call for Participation: http://balisage.net/Call4Participation.html
About Balisage: http://balisage.net/Call4Participation.html

pre-conference symposium:
Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions
Chair: Evan Owens, Cenveo
https://www.balisage.net/UpTransform/index.html

Increasing the granularity and/or specificity of markup is an important task in many content and information workflows. Markup transformations might involve tasks such as high-level structuring, detailed component structuring, or enhancing information by matching or linking to external vocabularies or data. Enhancing markup presents secondary challenges including lack of structure of the inputs or inconsistency of input data down to the level of spelling, punctuation, and vocabulary. Source data for up-translation may be XML, word processing documents, plain text, scanned & OCRed text, or databases; transformation goals may be content suitable for page makeup, search, or repurposing, in XML, JSON, or any other markup language.

The range of approaches to up-transformation is as varied as the variety of specifics of the input and required outputs. Solutions may combine automated processing with human review or could be 100% software implementations. With the potential for requirements to evolve over time, tools may have to be actively maintained and enhanced. This is the place to discuss goals, challenges, solutions, and workflows for significant XML enhancements, including approaches, tools, and techniques that may potentially be used for a variety of other tasks.

For more information: info@balisage.net or +1 301 315 9631

I’m planning on posting tomorrow one way or the other!

While you wait for that, get to work on your Balisage paper!

March 14, 2017

Pre-Installed Malware – Espionage Potential

Filed under: Cybersecurity,Journalism,News,Reporting — Patrick Durusau @ 11:03 am

Malware found pre-installed on dozens of different Android devices by David Bisson.

From the post:

Malware in the form of info-stealers, rough ad networks, and even ransomware came pre-installed on more than three dozen different models of Android devices.

Researchers with Check Point spotted the malware on 38 Android devices owned by a telecommunications company and a multinational technology company.

See David’s post for the details but it raises the intriguing opportunity to supply government and corporate offices with equipment with malware pre-installed.

No more general or targeted phishing schemes, difficult attempts to breach physical security and/or to avoid anti-virus or security programs.

The you leak – we print model of the news media makes it unlikely news organizations will want to get their skirts dirty pre-installing malware on hardware.

News organizations consider themselves “ethical” in publishing stolen information but are unwilling to steal it themselves, because stealing is “unethical.”

There’s some nuance in there I am missing, perhaps that being proven to have stolen carries a prison sentence in most places. Odd how ethics correspond to self-interest isn’t it?

If you are interested in the number of opportunities for malware on computers in 2017, check out Computers Sold This Year. It reports as of today over 41 million computers sold this year alone.

News organizations don’t have the skills to create a malware network but if information were treated as having value, separate from the means of its acquisition, a viable market would not be far behind.

New Wiper Malware – A Path To Involuntary Transparency

Filed under: Cybersecurity,Security — Patrick Durusau @ 9:19 am

From Shamoon to StoneDrill – Advanced New Destructive Malware Discovered in the Wild by Kaspersky Lab

From the press release:

The Kaspersky Lab Global Research and Analysis Team has discovered a new sophisticated wiper malware, called StoneDrill. Just like another infamous wiper, Shamoon, it destroys everything on the infected computer. StoneDrill also features advanced anti-detection techniques and espionage tools in its arsenal. In addition to targets in the Middle East, one StoneDrill target has also been discovered in Europe, where wipers used in the Middle East have not previously been spotted in the wild.

Besides the wiping module, Kaspersky Lab researchers have also found a StoneDrill backdoor, which has apparently been developed by the same code writers and used for espionage purposes. Experts discovered four command and control panels which were used by attackers to run espionage operations with help of the StoneDrill backdoor against an unknown number of targets.

Perhaps the most interesting thing about StoneDrill is that it appears to have connections to several other wipers and espionage operations observed previously. When Kaspersky Lab researchers discovered StoneDrill with the help of Yara-rules created to identify unknown samples of Shamoon, they realised they were looking at a unique piece of malicious code that seems to have been created separately from Shamoon. Even though the two families – Shamoon and StoneDrill – don’t share the exact same code base, the mind-set of the authors and their programming “style” appear to be similar. That’s why it was possible to identify StoneDrill with the Shamoon-developed Yara-rules.

Code similarities with older known malware were also observed, but this time not between Shamoon and StoneDrill. In fact StoneDrill uses some parts of the code previously spotted in the NewsBeef APT, also known as Charming Kitten – another malicious campaign which has been active in the last few years.

For details beyond the press release, see: From Shamoon to StoneDrill: Wipers attacking Saudi organizations and beyond by Costin Raiu, Mohamad Amin Hasbini, Sergey Belov, Sergey Mineev or the full report, same title, version 1.05.

Wipers can impact corporate and governmental operations but they may be hiding crimes and misdeeds at the same time.

Of greater interest are the espionage operations enabled by StoneDrill.

If you are interested in planting false flags, pay particular attention to the use of language analysis in the full report.

Taking a clue from Lakoff on framing, would you opinion of StoneDrill change if instead of “espionage” it was described as a “corporate/government transparency” tool?

I don’t recall anyone saying that transparency is by definition voluntary.

Perhaps that’s the ticket. Malware can bring about involuntary transparency.

Yes?

Older Posts »

Powered by WordPress