Archive for October, 2016

Guide to Making Search Relevance Investments, free ebook

Thursday, October 20th, 2016

Guide to Making Search Relevance Investments, free ebook

Doug Turnbull writes:

How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!

After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.

That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!

The free ebook? Guide to Relevance Investments.

I know, I know, the title is a interest killer.

Think Search ROI. Not something you hear about often but it sounds attractive.

Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.

Search and investment in search is a business decision and this guide nudges you in that direction.

What you do next is up to you.


Every Congressional Research Service Report – 8,000+ and growing!

Wednesday, October 19th, 2016

From the homepage:

We’re publishing reports by Congress’s think tank, the Congressional Research Service, which provides valuable insight and non-partisan analysis of issues of public debate. These reports are already available to the well-connected — we’re making them available to everyone for free.

From the about page:

Congressional Research Service reports are the best way for anyone to quickly get up to speed on major political issues without having to worry about spin — from the same source Congress uses.

CRS is Congress’ think tank, and its reports are relied upon by academics, businesses, judges, policy advocates, students, librarians, journalists, and policymakers for accurate and timely analysis of important policy issues. The reports are not classified and do not contain individualized advice to any specific member of Congress. (More: What is a CRS report?)

Until today, CRS reports were generally available only to the well-connected.

Now, in partnership with a Republican and Democratic member of Congress, we are making these reports available to everyone for free online.

A coalition of public interest groups, journalists, academics, students, some Members of Congress, and former CRS employees have been advocating for greater access to CRS reports for over twenty years. Two bills in Congress to make these reports widely available already have 10 sponsors (S. 2639 and H.R. 4702, 114th Congress) and we urge Congress to finish the job.

This website shows Congress one vision of how it could be done.

What does include? includes 8,255 CRS reports. The number changes regularly.

It’s every CRS report that’s available on Congress’s internal website.

We redact the phone number, email address, and names of virtually all the analysts from the reports. We add disclaimer language regarding copyright and the role CRS reports are intended to play. That’s it.

If you’re looking for older reports, our good friends at may have them.

We also show how much a report has changed over time (whenever CRS publishes an update), provide RSS feeds, and we hope to add more features in the future. Help us make that possible.

To receive an email alert for all new reports and new reports in a particular topic area, use the RSS icon next to the topic area titles and a third-party service, like IFTTT, to monitor the RSS feed for new additions.

This is major joyful news for policy wonks and researchers everywhere.

A must bookmark and contribute to support site!

My joy was alloyed by the notice:

We redact the phone number, email address, and names of virtually all the analysts from the reports. We add disclaimer language regarding copyright and the role CRS reports are intended to play. That’s it.

The privileged, who get the CRS reports anyway, have that information?

What is the value in withholding it from the public?

Support the project but let’s put the public on an even footing with the privileged shall we?

The Podesta Emails [In Bulk]

Wednesday, October 19th, 2016

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:


and, advanced:


As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!


#Truth2016 – The year when truth “interfered” with a democratic election.

Wednesday, October 19th, 2016

Unless you have been in solitary confinement or a medically induced coma for the last several weeks, you are aware that Wikileaks has been accused of “interfering” with the 2016 US presidential election.

The crux of that complaint is the release by Wikileaks of a series of emails collectively known as the Podesta Emails, which are centered on the antics of Hillary Clinton and her crew as she runs for the presidency.

The untrustworthy who made these accusations include the Department of Homeland Security and the Office of the Director of National Intelligence on Election Security. In a no facts revealed statement: Joint Statement from the Department Of Homeland Security and Office of the Director of National Intelligence on Election Security, the claim of interference is made but not substantiated.

The cry of “interference” has been taken up by an uncritical media and echoed by President Barack Obama.

There’s just one problem.

We know who was sent the emails in question and despite fanciful casting of doubt on their accuracy, out of hundred of participants, not one, nary one, has stepped forward with an original email to prove these are false.

Simple enough to ask some third-party expert to retrieve the emails in question from a server and then to compare to the Wikileaks releases.

But I have heard of no moves in that direction.

Have you?

The crux of the current line by the US government is that truthful documents may influence the coming presidential election. In a direction they don’t like.

Think about that for a moment: Truthful documents (in the sense of accuracy) interfering with a democratic election.

That makes me wonder what definition of “democratic” that Clinton, Obama and the media must share?

Not anything I would recognize as a democracy. You?

S20-211a Hebrew Bible Technology Buffet – November 20, 2016 (save that date!)

Tuesday, October 18th, 2016

S20-211a Hebrew Bible Technology Buffet

From the webpage:

On Sunday, November 20th 2016, from 1:00 PM to 3:30 PM, GERT will host a session with the theme “Hebrew Bible Technology Buffet” at the SBL Annual Meeting in room 305 of the Convention Center. Barry Bandstra of Hope College will preside.

The session has four presentations:

Presentations will be followed by a discussion session.

You will need to register for the Annual Meeting to attend the session.

Assuming they are checking “badges” to make sure attendees have registered. Registration is very important to those who “foster” biblical scholarship by comping travel and rooms for their close friends.

PS: The website reports non-member registration is $490.00. I would like to think that is a mis-print but I suspect its not.

That’s one way to isolate yourself from an interested public. By way of contrast, snail-mail Biblical Greek courses in the 1890’s had tens of thousands of subscribers. When academics complain of being marginalized, use this as an example of self-marginalization.

Threatening the President: A Signal/Noise Problem

Tuesday, October 18th, 2016

Even if you can’t remember why the pointy end of a pencil is important, you too can create national news.

This bit of noise reminded me of an incident when I was in high school where some similar type person bragged in a local bar about assassinating then President Nixon*. Was arrested and sentenced to several years in prison.

At the time I puzzled briefly over the waste of time and effort in such a prosecution and then promptly forgot it.

Until this incident with the overly “clever” Trump supporter.

To get us off on the same foot:

18 U.S. Code § 871 – Threats against President and successors to the Presidency

(a) Whoever knowingly and willfully deposits for conveyance in the mail or for a delivery from any post office or by any letter carrier any letter, paper, writing, print, missive, or document containing any threat to take the life of, to kidnap, or to inflict bodily harm upon the President of the United States, the President-elect, the Vice President or other officer next in the order of succession to the office of President of the United States, or the Vice President-elect, or knowingly and willfully otherwise makes any such threat against the President, President-elect, Vice President or other officer next in the order of succession to the office of President, or Vice President-elect, shall be fined under this title or imprisoned not more than five years, or both.

(b) The terms “President-elect” and “Vice President-elect” as used in this section shall mean such persons as are the apparent successful candidates for the offices of President and Vice President, respectively, as ascertained from the results of the general elections held to determine the electors of President and Vice President in accordance with title 3, United States Code, sections 1 and 2. The phrase “other officer next in the order of succession to the office of President” as used in this section shall mean the person next in the order of succession to act as President in accordance with title 3, United States Code, sections 19 and 20.

Commonplace threatening letters, calls, etc., aren’t documented for the public but President Barack Obama has a Wikipedia page devoted to the more significant ones: Assassination threats against Barack Obama.

Just as no one knows you are a dog on the internet, no one can tell by looking at a threat online if you are still learning how to use a pencil or are a more serious opponent.

Leaving to one side that a truly serious opponent allows actions to announce their presence or goal.

The treatment of even idle bar threats as serious is an attempt to improve the signal-to-noise ratio:

In analog and digital communications, signal-to-noise ratio, often written S/N or SNR, is a measure of signal strength relative to background noise. The ratio is usually measured in decibels (dB) using a signal-to-noise ratio formula. If the incoming signal strength in microvolts is Vs, and the noise level, also in microvolts, is Vn, then the signal-to-noise ratio, S/N, in decibels is given by the formula: S/N = 20 log10(Vs/Vn)

If Vs = Vn, then S/N = 0. In this situation, the signal borders on unreadable, because the noise level severely competes with it. In digital communications, this will probably cause a reduction in data speed because of frequent errors that require the source (transmitting) computer or terminal to resend some packets of data.

I’m guessing the reasoning is the more threats that go unspoken, the less chaff the Secret Service has to winnow in order to uncover viable threats.

One assumes they discard physical mail with return addresses of prisons, mental hospitals, etc., or at most request notice of the release of such people from state custody.

Beyond that, they don’t appear to be too picky about credible threats, noting that in one case an unspecified “death ray” was going to be used against President Obama.

The EuroNews description of that case must be shared:

Two American men have been arrested and charged with building a remote-controlled X-ray machine intended for killing Muslims and other perceived enemies of the U.S.

Following a 15-month investigation launched in April 2012, Glenn Scott Crawford and Eric J. Feight are accused of developing the device, which the FBI has described as “mobile, remotely operated, radiation emitting and capable of killing human targets silently and from a distance with lethal doses of radiation”.

Sure, right. I will post a copy of the 67-page complaint, which uses terminology rather loosely, to say the least, in a day or so. Suffice it to say that the defendants never acquired a source for the needed radioactivity production.

On the order of having a complete nuclear bomb but not nuclear material to make it into a nuclear bomb. You would be in more danger from the conventional explosive degrading than the bomb as a nuclear weapon.

Those charged with defending public officials want to deter the making of threats, so as to improve the signal/noise ratio.

The goal of those attacking public officials is a signal/noise ratio of exactly 0.0.

Viewing threats from an information science perspective suggests various strategies for either side. (Another dividend of studying information science.)

*They did find a good picture of Nixon for the White House page. Doesn’t look as much like a weasel as he did in real life. Gimp/Photoshop you think?

How To Read: “War Goes Viral” (with caution, propaganda ahead)

Monday, October 17th, 2016


War Goes Viral – How social media is being weaponized across the world by Emerson T. Brooking and P. W. Singer.

One of the highlights of the post reads:

Perhaps the greatest danger in this dynamic is that, although information that goes viral holds unquestionable power, it bears no special claim to truth or accuracy. Homophily all but ensures that. A multi-university study of five years of Facebook activity, titled “The Spreading of Misinformation Online,” was recently published in Proceedings of the National Academy of Sciences. Its authors found that the likelihood of someone believing and sharing a story was determined by its coherence with their prior beliefs and the number of their friends who had already shared it—not any inherent quality of the story itself. Stories didn’t start new conversations so much as echo preexisting beliefs.

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Ooooh, “…’truth’ becomes a matter of emotional resonance.”

That is always true but give the authors their due, “War Goes Viral” is a masterful piece of propaganda to the contrary.

Calling something “propaganda,” or “media bias” is easy and commonplace.

Let’s do the hard part and illustrate why that is the case with “War Goes Viral.”

The tag line:

How social media is being weaponized across the world

preps us to think:

Someone or some group is weaponizing social media.

So before even starting the article proper, we are prepared to be on the look out for the “bad guys.”

The authors are happy to oblige with #AllEyesOnISIS, first paragraph, second sentence. “The self-styled Islamic State…” appears in the second paragraph and ISIS in the third paragraph. Not much doubt who the “bad guys” are at this point in the article.

Listing only each change of current actors, “bad guys” in red, the article from start to finish names:

  • Islamic State
  • Russia
  • Venezuela
  • China
  • U.S. Army training to combat “bad guys”
  • Israel – neutral
  • Islamic State (Hussain)

The authors leave you with little doubt who they see as the “bad guys,” a one-sided view of propaganda and social media in particular.

For example, there is:

No mention of Voice of American (VOA), perhaps one of the longest running, continuous disinformation campaigns in history.

No mention of Pentagon admits funding online propaganda war against Isis.

No mention of any number of similar projects and programs which weren’t constructed with an eye on “truth and accuracy” by the United States.

The treatment here is as one-sided as the “weaponized” social media of which the authors complain.

Not that the authors are lacking in skill. They piggyback their own slant onto The Spreading of Misinformation Online:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

How much of that is supported by The Spreading of Misinformation Online?

  • First sentence
  • Second sentence
  • Both sentences

The answer is:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.”

The remainder of that paragraph was invented out of whole clothe by the authors and positioned with “truth” in quotes to piggyback on the legitimate academic work just quoted.

As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Is popular cant among media and academic types but no more than that.

Skilled reporting can put information in a broad context and weave a coherent narrative, but disparaging social media authors doesn’t make that any more likely.

“War Goes Viral” being a case in point.

XML Prague 2017 is coming

Sunday, October 16th, 2016

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

Why I Distrust US Intelligence Experts, Let Me Count the Ways

Sunday, October 16th, 2016

Some US Intelligence failures, oldest to most recent:

  1. Pearl Harbor
  2. The Bay of Pigs Invasion
  3. Cuban Missile Crisis
  4. Vietnam
  5. Tet Offensive
  6. Yom Kippur War
  7. Iranian Revolution
  8. Soviet Invasion of Afghanistan
  9. Collapse of the Soviet Union
  10. Indian Nuclear Test
  11. 9/11 Attacks
  12. Iraq War (WMDs)
  13. Invasion of Afghanistan (US)
  14. Israeli moles in US intelligence, various dates

Those are just a few of the failures of US intelligence, some of which cost hundreds of thousands if not millions of lives.

Yet, you can read today: Trump’s refusal to accept intelligence briefing on Russia stuns experts.

There are only three reasons I can think of to accept findings by the US intelligence community:

  1. You are on their payroll and for that to continue, well, you know.
  2. As a member of the media, future tips/leaks depends upon your acceptance of current leaks. Anyone who mocks intelligence service lies is cut off from future lies.
  3. As a politician, the intelligence findings discredit facts unfavorable to you.

For completeness sake, I should mention that intelligence “experts” could be telling the truth but given their track record, it is an edge case.

Before repeating the mindless cant of “the Russians are interfering with the US election,” stop to ask your sources, “…based on what?” Opinions of all the members of the US intelligence community = one opinion. Ask for facts. No facts offered, report that instead of the common “opinion.”

Why Journalists Should Not Rely On Wikileaks Indexing – Podesta Emails

Saturday, October 15th, 2016

Clinton on Fracking, or, Another Reason to Avoid Wikileaks Indexing


The quote in the tweet is false.

Politico supplies the correct quotation in its post:

“Bernie Sanders is getting lots of support from the most radical environmentalists because he’s out there every day bashing the Keystone pipeline. And, you know, I’m not into it for that,” Clinton told the unions, according to the transcript. “My view is, I want to defend natural gas. … I want to defend fracking under the right circumstances.”

I’m guessing that “…under the right circumstances.” must have pushed Wikileaks too close to the 140 character barrier.

Ditto for the Wikileaks mis-quote of: “Get a life.”

Which reported as in the tweet, appears to refer to unbridled fracking.

Not so in the Politico post:

“I’m already at odds with the most organized and wildest” of the environmental movement, Clinton told building trades unions in September 2015, according to a transcript of the remarks apparently circulated by her aides. “They come to my rallies and they yell at me and, you know, all the rest of it. They say, ‘Will you promise never to take any fossil fuels out of the earth ever again?’ No. I won’t promise that. Get a life, you know.”

Doesn’t read quite the same way does it?

I supposed once you start lying it’s really hard to stop. Clinton is a good example of that and Wikileaks should not follow her example.

It’s hard to spot these lies because Wikileaks isn’t indexing the attachments.

You can search all day for “defend fracking,” “get a life” (by Clinton) and you will come up empty (at least as of today).

So that you don’t have to search for: 20150909 Transcript | Building Trades Union (Keystone XL) at Wikileaks – Podesta Emails, I have produced a PDF version of that attachment, Building-Trades-Union-Clinton-Sept-09-2015.pdf (my naming), for your viewing pleasure.

Green’s Dictionary of Slang [New Commercializing Information Model?]

Friday, October 14th, 2016

Green’s Dictionary of Slang

From the about page:

Green’s Dictionary of Slang is the largest historical dictionary of English slang. Written by Jonathon Green over 17 years from 1993, it reached the printed page in 2010 in a three-volume set containing nearly 100,000 entries supported by over 400,000 citations from c. ad 1000 to the present day. The main focus of the dictionary is the coverage of over 500 years of slang from c. 1500 onwards.

The printed version of the dictionary received the Dartmouth Medal for outstanding works of reference from the American Library Association in 2012; fellow recipients include the Dictionary of American Regional English, the Oxford Dictionary of National Biography, and the New Grove Dictionary of Music and Musicians. It has been hailed by the American New York Times as ‘the pièce de résistance of English slang studies’ and by the British Sunday Times as ‘a stupendous achievement, in range, meticulous scholarship, and not least entertainment value’.

On this website the dictionary is now available in updated online form for the first time, complete with advanced search tools enabling search by definition and history, and an expanded bibliography of slang sources from the early modern period to the present day. Since the print edition, nearly 60,000 quotations have been added, supporting 5,000 new senses in 2,500 new entries and sub-entries, of which around half are new slang terms from the last five years.

Green’s Dictionary of Slang has an interesting commercial model.

You can search for any word, freely, but “more search features” requires a subscription:

By subscribing to Green’s Dictionary of Slang Online, you gain access to advanced search tools (including the ability to search for words by meaning, history, and usage), full historical citations in each entry, and a bibliography of over 9,000 slang sources.

Current rate for individuals is £ 49 (or about $59.96).

In addition to being a fascinating collection of information, is the free/commercial split here of interest?

An alternative to:

The Teaser Model

Contrast the Oxford Music Online:

Grove Music Online is the eighth edition of Grove’s Dictionary of Music and Musicians, and contains articles commissioned specifically for the site as well as articles from New Grove 2001, Grove Opera, and Grove Jazz. The recently published second editions of The Grove Dictionary of American Music and The Grove Dictionary of Musical Instruments are still being put online, and new articles are added to GMO with each site update.

Oh, Oxford Music Online isn’t all pay-per-view.

It offers the following thirteen (13) articles for free viewing:

Sotiria Bellou, Greek singer of rebetiko song, famous for the special quality and register of her voice

Cell [Mobile] Phone Orchestra, ensemble of performers using programmable mobile (cellular) phones

Crete, largest and most populous of the Greek islands

Lyuba Encheva, Bulgarian pianist and teacher

Gaaw, generic term for drums, and specifically the frame drum, of the Tlingit and Haida peoples of Alaska

Johanna Kinkel, German composer, writer, pianist, music teacher, and conductor

Lady’s Glove Controller, modified glove that can control sound, mechanical devices, and lights

Outsider music, a loosely related set of recordings that do not fit well within any pre-existing generic framework

Peter (Joshua) Sculthorpe, Australian composer, seen by the Australian musical public as the most nationally representative.

Slovenia, country in southern Central Europe

Sound art, a term ecompassing a variety of art forms that utlize sound, or comment on auditory cultures

Alice (Bigelow) Tully, American singer and music philanthropist

Wars in Iraq and Afghanistan, soliders’ relationship with music is largely shaped by contemporary audio technology

Hmmm, 160,000 slang terms for free from Green’s Dictionary of Slang versus 13 free articles from Oxford Music Online.

Show of hands for the teaser model of Oxford Music Online?

The Consumer As Product

You are aware that casual web browsing and alleged “free” sites are not just supported by ads, but by the information they collect on you?

Consider this rather boastful touting of information collection capabilities:

To collect online data, we use our native tracking tags as experience has shown that other methods require a great deal of time, effort and cost on both ends and almost never yield satisfactory coverage or results since they depend on data provided by third parties or compiled by humans (!!), without being able to verify the quality of the information. We have a simple universal server-side tag that works with most tag managers. Collecting offline marketing data is a bit trickier. For TV and radio, we will with your offline advertising agency to collect post-log reports on a weekly basis, transmitted to a secure FTP. Typical parameters include flight and cost, date/time stamp, network, program, creative length, time of spot, GRP, etc.

Convertro is also able to collect other type of offline data, such as in-store sales, phone orders or catalog feeds. Our most popular proprietary solution involves placing a view pixel within a confirmation email. This makes it possible for our customers to tie these users to prior online activity without sharing private user information with us. For some customers, we are able to match almost 100% of offline sales. Other customers that have different conversion data can feed them into our system and match it to online activity by partnering with LiveRamp. These matches usually have a success rate between 30%-50%. Phone orders are tracked by utilizing a smart combination of our in-house approach, the inputting of special codes, or by third party vendors such as Mongoose and ResponseTap.v

You don’t have to be on the web, you can be tracked “in-store,” on the phone, etc.

Converto doesn’t mention explicitly “supercookies,” for which Verizon just paid a $1.35 Million fine. From the post:

“Supercookies,” known officially as unique identifier headers [UIDH], are short-term serial numbers used by corporations to track customer data for advertising purposes. According to Jacob Hoffman-Andrews, a technologist with the Electronic Frontier Foundation, these cookies can be read by any web server one visits used to build individual profiles of internet habits. These cookies are hard to detect, and even harder to get rid of.

If any of that sounds objectionable to you, remember that to be valuable, user habits must be tracked.

That is if you find the idea of being a product acceptable.

The Green’s Dictionary of Slang offers an economic model that enables free access to casual users, kids writing book reports, journalists, etc., while at the same time creating a value-add that power users will pay for.

Other examples of value-add models with free access to the core information?

What would that look like for the Podesta emails?

Becoming a Data Scientist:

Thursday, October 13th, 2016

Becoming a Data Scientist: Advice From My Podcast Guests

Out-gassing from political candidates has kept pushing this summary by Renée Teate back in my queue. Well, fixing that today!

René has created more data science resources than I can easily mention so in addition to this guide, I will mention only two:

Data Science Renee @BecomingDataSci, a Twitter account that will soon break into the rarefied air of > 10,000 followers. Not yet, but you may be the one that puts her over the top!

Looking for women to speak at data science conferences? Renée maintains Women in Data Science, which today has 815 members.

Sorry, three, her blog: Becoming a Data Scientist.

That should keep you busy/distracted until the political noise subsides. 😉

Obama on Fixing Government with Technology (sigh)

Thursday, October 13th, 2016

Obama on Fixing Government with Technology by Caitlin Fairchild.

Like any true technology cultist, President Obama mentions technology, inefficiency, but never the people who make up government as the source of government “problems.” Nor does he appear to realize that technology cannot fix the people who make up government.

Those out-dated information systems he alludes to were built and are maintained under contract with vendors. Systems that are used by users who are accustomed to those systems and will resist changing to others. Still other systems rely upon those systems being as they are in terms of work flow. And so on. At its very core, the problem of government isn’t technology.

It’s the twin requirement that it be composed of and supplied by people, all of who have a vested interest and comfort level with the technology they use and, don’t forget, government has to operate 24/7, 365 days out of the year.

There is no time to take down part of the government to develop new technology, train users in its use and at the same time, run all the current systems which are, to some degree, meeting current requirements.

As an antidote to the technology cultism that infects President Obama and his administration, consider reading Geek Heresy, the description of which reads:

In 2004, Kentaro Toyama, an award-winning computer scientist, moved to India to start a new research group for Microsoft. Its mission: to explore novel technological solutions to the world’s persistent social problems. Together with his team, he invented electronic devices for under-resourced urban schools and developed digital platforms for remote agrarian communities. But after a decade of designing technologies for humanitarian causes, Toyama concluded that no technology, however dazzling, could cause social change on its own.

Technologists and policy-makers love to boast about modern innovation, and in their excitement, they exuberantly tout technology’s boon to society. But what have our gadgets actually accomplished? Over the last four decades, America saw an explosion of new technologies – from the Internet to the iPhone, from Google to Facebook – but in that same period, the rate of poverty stagnated at a stubborn 13%, only to rise in the recent recession. So, a golden age of innovation in the world’s most advanced country did nothing for our most prominent social ill.

Toyama’s warning resounds: Don’t believe the hype! Technology is never the main driver of social progress. Geek Heresy inoculates us against the glib rhetoric of tech utopians by revealing that technology is only an amplifier of human conditions. By telling the moving stories of extraordinary people like Patrick Awuah, a Microsoft millionaire who left his lucrative engineering job to open Ghana’s first liberal arts university, and Tara Sreenivasa, a graduate of a remarkable South Indian school that takes children from dollar-a-day families into the high-tech offices of Goldman Sachs and Mercedes-Benz, Toyama shows that even in a world steeped in technology, social challenges are best met with deeply social solutions.

Government is a social problem and to reach for a technology fix first, is a guarantee of yet another government failure.

IBM’s Program Of Security Via Obscurity (Censorship)

Thursday, October 13th, 2016

Before today, my response to the question: “Does IBM promote security through obscurity?” would have been no!

Today? Full Disclosure @SecLists posted this tweet:


A working version of the URL:

I don’t suppose better software engineering practices and/or rapid repair of IBM’s software occurred to anyone?

George Carlin’s Seven Dirty Words in Podesta Emails – Discovered 981 Unindexed Documents

Thursday, October 13th, 2016

While taking a break from serious crunching of the Podesta emails I discovered 981 unindexed documents at Wikileaks!

Try searching for Carlin’s seven dirty words at The Podesta Emails:

  • shit – 44
  • piss – 19
  • fuck – 13
  • cunt – 0
  • cocksucker – 0
  • motherfucker – 0 (?)
  • tits – 0

I have a ? after “motherfucker” because working with the raw files I show one (1) hit for “motherfucker” and one (1) hit for “motherfucking.” Separate emails.

For “motherfucker,” American Sniper–the movie, responded to by Chris Hedges – To: Podesta@Law.Georgetown.Edu

For “motherfucking,” H4A News Clips 5.31.15 – From/To:

“Motherfucker” and “motherfucking” occur in text attachments to emails, which Wikileaks does not search.

If you do a blank search for file attachments, Wikileaks reports there are 2427 file attachments.

Searching the Podesta emails at Wikileaks excludes the contents of 2427 files from your search results.

How significant is that?

Hmmm, 302 pdf, 501 docx, 167 doc, 12 xls, 9 xlsx – 981 documents excluded from your searches at Wikileaks.

For 9,011 emails, as of AM today, my local.

How comfortable are you with not searching those 981 documents? (Or additional documents that may follow?)

How-To Spot An Armchair Jihadist

Wednesday, October 12th, 2016

To efficiently use law enforcement resources against threats to civil order, the police must recognize the difference between an actual jihadist and an armchair jihadist.

An armchair jihadist is one that talks a good game, dreams of raining fire and death on infidels, etc., but in truth, is the Walter Mitty of terrorism.

Unfortunately, law enforcement disproportionately captures armchair jihadists, for example, the arrest of Samata Ullah, who was charged in part with possession of:

…a book about guided missiles and a PDF version of a book about advanced missile guidance and control for a purpose connected with the commission, preparation or instigation of terrorism”

Admitting the romanticism of building one’s own arsenal, how successful do you think an individual or even a large group of individuals would be at building and testing a guided missile?

Here’s a broad outline of the major steps to building a laser guided missile:

The Manufacturing Process

Constructing the body and attaching the fins

1 The steel or aluminum body is die cast in halves. Die casting involves pouring molten metal into a steel die of the desired shape and letting the metal harden. As it cools, the metal assumes the same shape as the die. At this time, an optional chromium coating can be applied to the interior surfaces of the halves that correspond to a completed missile’s cavity. The halves are then welded together, and nozzles are added at the tail end of the body after it has been welded.

2 Moveable fins are now added at predetermined points along the missile body. The fins can be attached to mechanical joints that are then welded to the outside of the body, or they can be inserted into recesses purposely milled into the body.

Casting the propellant

3 The propellant must be carefully applied to the missile cavity in order to ensure a uniform coating, as any irregularities will result in an unreliable burning rate, which in turn detracts from the performance of the missile. The best means of achieving a uniform coating is to apply the propellant by using centrifugal force. This application, called casting, is done in an industrial centrifuge that is well-shielded and situated in an isolated location as a precaution against fire or explosion.

Assembling the guidance system

4 The principal laser components—the photo detecting sensor and optical filters—are assembled in a series of operations that are separate from the rest of the missile’s construction. Circuits that support the laser system are then soldered onto pre-printed boards; extra attention is given to optical materials at this time to protect them from excessive heat, as this can alter the wavelength of light that the missile will be able to detect. The assembled laser subsystem is now set aside pending final assembly. The circuit boards for the electronics suite are also assembled independently from the rest of the missile. If called for by the design, microchips are added to the boards at this time.

5 The guidance system (laser components plus the electronics suite) can now be integrated by linking the requisite circuit boards and inserting the entire assembly into the missile body through an access panel. The missile’s control surfaces are then linked with the guidance system by a series of relay wires, also entered into the missile body via access panels. The photo detecting sensor and its housing, however, are added at this point only for beam riding missiles, in which case the housing is carefully bolted to the exterior diameter of the missile near its rear, facing backward to interpret the laser signals from the parent aircraft.

Final assembly

6 Insertion of the warhead constitutes the final assembly phase of guided missile construction. Great care must be exercised during this process, as mistakes can lead to catastrophic accidents. Simple fastening techniques such as bolting or riveting serve to attach the warhead without risking safety hazards. For guidance systems that home-in on reflected laser light, the photo detecting sensor (in its housing) is bolted into place at the tip of the warhead. On completion of this final phase of assembly, the manufacturer has successfully constructed on of the most complicated, sophisticated, and potentially dangerous pieces of hardware in use today.

Quality Control

Each important component is subjected to rigorous quality control tests prior to assembly. First, the propellant must pass a test in which examiners ignite a sample of the propellant under conditions simulating the flight of a missile. The next test is a wind tunnel exercise involving a model of the missile body. This test evaluates the air flow around the missile during its flight. Additionally, a few missiles set aside for test purposes are fired to test flight characteristics. Further work involves putting the electronics suite through a series of tests to determine the speed and accuracy with which commands get passed along to the missile’s control surfaces. Then the laser components are tested for reliability, and a test beam is fired to allow examiners to record the photo detecting sensor’s ability to “read” the proper wavelength. Finally, a set number of completed guided missiles are test fired from aircraft or helicopters on ranges studded with practice targets.

Did Samata Ullah have the expertise and/or access to the expertise or manufacturing capability for any of those steps?

Moreover, could Samata Ullah have tested and developed a guided missile without someone noticing?

Possession of first principle reading materials, such as chemistry, rocket, missile, etc., manuals or guides is a clear sign an alleged jihadist is an armchair jihadist.

Another sign of an armchair jihadist, along with the possession of such reading materials, is their failure to obtain explosives, weapons, etc., in an effective way.

The United States, via the CIA and the US military, routinely distributes explosives and weapons around the world to various factions.

A serious jihadist need only travel to well known locations and get in line for explosives, RPGs (rocket-propelled grenades), mortars, etc.

Does the weapon in this photo look homemade?


Of course not! Anyone with a passport and a little imagination can possess a wide variety of harmful devices.

But then, they are not an armchair jihadist.

DIY missile/explosive reading clubs of jihadists are not threats to the public. Manufacturing of explosives and missiles are difficult and dangerous, tasks best left to professionals. They are more dangerous to each other than the general public.

When allocating law enforcement resources, remember that the only thing easier to acquire than weapons is possibly marijuana. Anyone planning on building weapons can be ignored as an armchair jihadist.

In the United States and the United Kingdom, law enforcement resources would be better spent in the pursuit of wealthy and governmental pedophiles.

PS: I started to edit the steps for building a guided missile for length but the description highlights the absurdity of the charges in question. Melting steel or aluminum and pouring it into a metal die? Please, that’s not a backyard activity. Neither is pouring molten rocket fuel using a centrifuge.

British and Irish Legal Information Institute

Tuesday, October 11th, 2016

British and Irish Legal Information Institute

From the webpage:

Welcome to BAILII, where you can find British and Irish case law & legislation, European Union case law, Law Commission reports, and other law-related British and Irish material. BAILII thanks The Scottish Council of Law Reporting for their assistance in establishing the Historic Scottish Law Reports project. BAILII also thanks Sentral for provision of servers. For more information, see About BAILII.

I ran across this wonderful legal resource while researching a legal issue in another post.

Obviously a great resource for legal research and scholars but also I suspect a great source of leisure reading, well, if you like that sort of thing.

The site also offered this handy list of world law resources:

When I said “leisure reading,” I was only partially joking. What we accept now as “the law,” wasn’t always so.

The history of how rights and obligations have evolved over centuries of human interaction are recorded in legislation and case law.

It is a history with all the mis-steps, failures, betrayals and intrigue that are commonplace in any human enterprise.


Parsing Foreign Law From News Reports (Warning For Journalists)

Tuesday, October 11th, 2016

Cory Doctorow‘s headline: Scotland Yard charge: teaching people to use crypto is an act of terrorism red-lined my anti-government biases.

I tend towards “unsound” reactions when free speech is being infringed upon.

But my alarm and perhaps yours as well. was needlessly provoked in this case.

Cory writes:

In other words, according to Scotland Yard, serving a site over HTTPS (as this one is) and teaching people to use crypto (as this site has done) and possessing a secure OS (as I do) are acts of terrorism or potential acts of terrorism. In some of the charges, the police have explicitly connected these charges with planning an act of terrorism, but in at least one of the charges (operating a site served over HTTPS and teaching people about crypto) the charge lacks this addendum — the mere act is considered worthy of terrorism charges.

The concern over:

but in at least one of the charges (operating a site served over HTTPS and teaching people about crypto) the charge lacks this addendum — the mere act is considered worthy of terrorism charges.

is mis-placed.

Cory points to the original report here: Man arrested on Cardiff street to face six terror charges by Viram Dodd.

Cory’s alarm is not repeated by Dodd:

Ullah has been charged with directing terrorism, providing training in encryption programs knowing the purpose was for terrorism, and using his blog site to provide such training. His activities are alleged to have “the intention of assisting another or others to commit acts of terrorism”.

Beyond that (I haven’t seen the charging document), be aware that under English Criminal Procedure, the “charge” on which Cory places so much weight is defined as:


Pay particular attention to 7.3(1)(a)(i) (page 65):

…describes the offense in ordinary language, and…

A “charge” isn’t a technical specification of an offense under English criminal procedure. Which means you attach legal significance to charging language at your own peril. And to the detriment of your readers.

PS: I have contacted the Westminster Magistrates’ Court and requested a copy of the charging document. If and when that arrives, I will update this post with it.

Bias in Data Collection: A UK Example

Monday, October 10th, 2016

Kelly Fiveash‘s story, UK’s chief troll hunter targets doxxing, virtual mobbing, and nasty images starts off:

Trolls who hurl abuse at others online using techniques such as doxxing, baiting, and virtual mobbing could face jail, the UK’s top prosecutor has warned.

New guidelines have been released by the Crown Prosecution Service to help cops in England and Wales determine whether charges—under part 2, section 44 of the 2007 Serious Crime Act—should be brought against people who use social media to encourage others to harass folk online.

It even includes “encouraging” statistics:

According to the most recent publicly available figures—which cite data between May 2013 and December 2014—1,850 people were found guilty in England and Wales of offences under section 127 of the Communications Act 2003. But the numbers reveal a steady climb in charges against trolls. In 2007, there were a total of 498 defendants found guilty under section 127 in England and Wales, compared with 693 in 2008, 873 in 2009, 1,186 in 2010 and 1,286 in 2011.

But the “most recent publicly available figures,” doesn’t ring true does it?

Imagine that, 1850 trolls out of a total population of England and Wales of 57 million. (England 53.9 million, Wales 3.1 million, mid-2013)


Let’s look at the referenced government data, 25015 Table.xls.

For the months of May 2013 to December 2014, there are only monthly totals of convictions.

What data is not being collected?

Among other things:

  1. Offenses reported to law enforcement
  2. Offenses investigated by law enforcement (not the same as #1)
  3. Conduct in question
  4. Relationship, if any, between the alleged offender/victim
  5. Race, economic status, location, social connections of alleged offender/victim
  6. Law enforcement and/or prosecutors involved
  7. Disposition of cases without charges being brought
  8. Disposition of cases after charges brought but before trial
  9. Charges dismissed by courts and acquittals
  10. Judges who try and/or dismiss charges
  11. Penalties imposed upon guilty plea and/or conviction
  12. Appeals and results on appeal, judges, etc.

All that information exists for every reported case of “trolls,” and is recorded at some point in the criminal justice process or could be discerned from those records.

Can you guess who isn’t collecting that information?

The TheyWorkForYou site reports at: Communications Act 2003, Jeremy Wright, The Parliamentary Under-Secretary of State for Justice, saying:

The Ministry of Justice Court Proceedings Database holds information on defendants proceeded against, found guilty and sentenced for criminal offences in England and Wales. This database holds information on offences provided by the statutes under which proceedings are brought but not the specific circumstances of each case. It is not possible to separately identify, in all cases brought under section 127 of the Communications Act 2003, whether a defendant sent or caused to send information to an individual or a small group of individuals or made the information widely available to the public. This detailed information may be held by the courts on individual case files which due to their size and complexity are not reported to Justice Analytical Services. As such this information can be obtained only at disproportionate cost.
… (emphasis added)

I was unaware that courts in England and Wales were still recording their proceedings on vellum. That would be expensive to manually gather that data together. (NOT!)

How difficult is it from any policy organization, whether seeking greater protection from trolls and/or opposing classes of prosecution based on discrimination and free speech to gather the same data?

Here is a map of the Crown Prosecution Service districts:


Counting the sub-offices in each area, I get forty-three separate offices.

But that’s only cases that are considered for prosecution and that’s unlikely to be the same number as reported to the police.

Checking for police districts in England, I get thirty-nine.


Plus, another four areas for Wales:


The Wikipedia article List of law enforcement agencies in the United Kingdom, Crown dependencies and British Overseas Territories has links for all these police areas, which in the interest of space, I did not repeat here.

I wasn’t able to quickly find a map of English criminal courts, although you can locate them by postcode at: Find the right court or tribunal. My suspicion is that Crown Prosecution Service areas correspond to criminal courts. But verify that for yourself.

In order to collect the information already in the possession of the government, you would have to search records in 43 police districts, 43 Crown Prosecution Service offices, plus as many as 43 criminal courts in which defendants may be prosecuted. All over England and Wales. With unhelpful clerks all along the way.

All while the government offers the classic excuse:

As such this information can be obtained only at disproportionate cost.

Disproportionate because:

Abuse of discretion, lax enforcement, favoritism, discrimination by police officers, Crown prosecutors, judges could be demonstrated as statistical facts?

Governments are old hands at not collecting evidence they prefer to not see thrown back in their faces.

For example: FBI director calls lack of data on police shootings ‘ridiculous,’ ‘embarrassing’.

Non-collection of data is a source of bias.

What bias is behind the failure to collect troll data in the UK?

When 24 GB Of Physical RAM Pegs At 98% And Stays There

Sunday, October 9th, 2016

Don’t panic! It has a happy ending but I’m too tired to write it up for posting today.

Tune in tomorrow for lessons learned on FOIA answers that don’t set the information free.

Chasing File Names – Check My Work

Saturday, October 8th, 2016

I encountered a stream of tweets of which the following are typical:


Hmmm, is cf.7z a different set of files from ebd-cf.7z?

You could “eye-ball” the directory listings but that is tedious and error-prone.

Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.


uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l

(“-d” for duplicate lines) on the resulting file, piping it into wc -l, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)


uniq -u cf-7z-and-ebd-cf-files-Sorted.txt

(“-u” for unique lines), will give you no return (no unique lines).

With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.

BTW, do you think a similar technique could be applied to spreadsheets?

The “Fact Free” U.S. Intelligence Community (USIC)

Friday, October 7th, 2016

The Joint Statement from the Department of Homeland Security and Office of the Director of National Intelligence on Election Security is a reminder of why the U.S. Intelligence Community (USIC) fails so very often.

The first paragraph:

The U.S. Intelligence Community (USIC) is confident that the Russian Government directed the recent compromises of e-mails from US persons and institutions, including from US political organizations. The recent disclosures of alleged hacked e-mails on sites like and WikiLeaks and by the Guccifer 2.0 online persona are consistent with the methods and motivations of Russian-directed efforts. These thefts and disclosures are intended to interfere with the US election process. Such activity is not new to Moscow—the Russians have used similar tactics and techniques across Europe and Eurasia, for example, to influence public opinion there. We believe, based on the scope and sensitivity of these efforts, that only Russia’s senior-most officials could have authorized these activities.

Do you see any facts in that first paragraph?

I see the conclusion “…are consistent with the methods and motivations of Russian-directed efforts,” but no facts to back that statement up.

Moreover, the second paragraph leaps for the “smoking gun” with:

Some states have also recently seen scanning and probing of their election-related systems, which in most cases originated from servers operated by a Russian company….

You would hope the U.S. Intelligence Community (USIC) would have heard of VPNs (virtual private networks).

No facts, just allegations that favor one party in the fast approaching U.S. presidential election.

Yes, intelligence agencies are interfering with the U.S. election, but its not Russian intelligence agencies.

DNC/DCCC/CF Excel Files, As Of October 7, 2016

Friday, October 7th, 2016

A continuation of my post Avoiding Viruses in DNC/DCCC/CF Excel Files.

Where Avoiding Viruses… focused on avoiding the hazards and dangers of Excel-born viruses, this post focuses on preparing the DNC/DCCC/CF Excel files from Guccifer 2.0, as of October 7, 2016, for further analysis.

As I mentioned before, you could search through all 517 files to date, separately, using Excel. That thought doesn’t bring me any joy. You?

Instead, I’m proposing that we prepare the files to be concatenated together, resulting in one fairly large file, which we can then search and manipulate as one entity.

As a data cleanliness task, I prefer to prefix every line in every csv export, with the name of its original file. That will enable us to extract lines that mention the same person over several files and still have a bread crumb trail back to the original files.

Munging all the files together without such a step, would leave us either grepping across the collection and/or using some other search mechanism. Why not plan on avoiding that hassle?

Given the number of files requiring prefixing, I suggest the following:

for f in *.csv*; do
sed -i "s/^/$f,/" $f

This shell script uses sed with the -i switch, which means sed changes files in place (think overwriting specified part). Here the s/ means to substitute at the ^, start of each line, $f, the filename plus a comma separator and the final $f, is the list of files to be processed.

There are any number of ways to accomplish this task. Your community may use a different approach.

The result of my efforts is: guccifer2.0-all-spreadsheets-07October2016.gz, which weighs in at 61 MB compressed and 231 MB uncompressed.

I did check and despite having variable row lengths, it does load in my oldish version of gnumeric. All 1030828 lines.

That’s not all surprising for gnumeric, considering I’m running 24 GB of physical RAM. Your performance may vary. (It did hesitate loading it.)

There is much left to be done, such as deciding what padding is needed to even out all the rows. (I have ideas, suggestions?)

Tools to manipulate the CSV. I have a couple of stand-bys and a new one that I located while writing this post.

And, of course, once the CSV is cleaned up, what other means can we use to explore the data?

My focus will be on free and high performance (amazing how often those are found together Larry Ellison) tools that can be easily used for exploring vast seas of spreadsheet data.

Next post on these Excel files, Monday, October 10, 2016.

I am downloading the cf.7z Guccifer 2.0 drop as I write this update.

Watch for updates on the comprehensive file list and Excel files next Monday. October 8, 2016, 01:04 UTC.

Avoiding Viruses in DNC/DCCC/CF Excel Files

Friday, October 7th, 2016

I hope you haven’t opened any of the DNC/DCCC/CF Excel files outside of a VM. 517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)


Files from trusted sources can contain viruses. Files from unknown or rogue sources even more so. However tempting (and easy) it is to open up alleged purloined files on your desktop, minimal security conscious users will resist the temptation.

Warning: I did NOT scan the Excel files for viruses. The best way to avoid Excel viruses is to NOT open Excel files.

I used ssconvert, one of the utilities included with gnumeric to bulk convert the Excel files to csv format. (Comma Separate Values is documents in RFC 4780.

Tip: If you are looking for a high performance spreadsheet application, take a look at gnumeric.

Ssconvert relies on file extensions (although other options are available) so I started with:

ssconvert -S donors.xlsx donors.csv

The -S option takes care of workbooks with multiple worksheets. You need a later version of ssconvert (mine is 1.12.9-1 (2013) and the current version of gnumeric and ssconvert is 1.12.31 (August 2016), to convert the .xlsx files without warning.

I’m upgrading to Ubuntu 16.04 soon so it wasn’t worth the trouble trying to stuff a later version of gnumeric/ssconvert onto my present Ubuntu 14.04.

Despite the errors, the conversion appears to have worked properly:


to its csv output:


I don’t see any problems.

I’m checking a sampling of the other conversions as well.

BTW, do notice the confirmation of reports from some commentators that they contacted donors who confirmed donating, but could not recall the amounts.

Could be true. If you pay protection money often enough, I’m sure it’s hard to recall a specific payment.

Sorry, I got distracted.

So, only 516 files to go.

I don’t recommend you do:

ssconvert -S filename.xlsx filename.csv

516 times. That will be tedious and error prone.

At least for Linux, I recommend:

for f in *.xls*; do
   ssconvert -S $f $f.csv

The *.xls* captures both .xsl and .xslx files, then invokes ssconvert -S on the file and then saves the output file with the original name, plus the extension .csv.

The wc -l command reports 1030828 lines in the consolidated csv file for these spreadsheets.

That’s a lot of lines!

I have some suggestions on processing that file, see: DNC/DCCC/CF Excel Files, As Of October 7, 2016.

517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

Thursday, October 6th, 2016

As of today, the data dumps by Guccifer2.0 have contained 517 Excel files.

The vehemence of posts dismissing this dumps makes me wonder two things:

  1. How many of the Excel files these commentators have reviewed?
  2. What is it that you might find in them that worries them so?

I don’t know the answer to #1 and I won’t speculate on their diligence in examining these files. You can reach your own conclusions in that regard.

Nor can I give you an answer to #2, but I may be able to help you explore these spreadsheets.

The old fashioned way, opening each file, at one Excel file per minute, assuming normal Office performance, ;-), would take longer than an eight-hour day to open them all.

You still must understand and compare the spreadsheets.

To make 517 Excel files more than a number, here’s a list of all the Guccifer2.0 released Excel files as of today: guccifer2.0-excel-files-sorted.txt.

(I do have an unfair advantage in that I am willing to share the files I generate, enabling you to check my statements for yourself. A personal preference for fact-based pleading as opposed to conclusory hand waving.)

If you think of each line in the spreadsheets as a record, this sounds like a record linkage problem. Except they have no uniform number of fields, headers, etc.

With record linkage, we would munge all the records into a single record format and then and only then, match up records to see which ones have data about the same subjects.

Thinking about that, the number 517 looms large because all the formats must be reconciled to one master format, before we start getting useful comparisons.

I think we can do better than that.

First step, let’s consider how to create a master record set that keeps all the data as it exists now in the spreadsheets, but as a single file.

See you tomorrow!

Unmasking Tor users with DNS

Thursday, October 6th, 2016

Unmasking Tor users with DNS by Mark Stockley.

From the post:

Researchers at the KTH Royal Institute of Technology, Stockholm, and Princeton University in the USA have unveiled a new way to attack Tor and deanonymise its users.

The attack, dubbed DefecTor by the researchers’ in their recently published paper The Effect of DNS on Tor’s Anonymity, uses the DNS lookups that accompany our browsing, emailing and chatting to create a new spin on Tor’s most well established weakness; correlation attacks.

If you want the lay-person’s explanation of the DNS issue with Tor, see Mark’s post. If you want the technical details, read The Effect of DNS on Tor’s Anonymity.

The immediate take away for the average user is this:

Donate, volunteer, support the Tor project.

Your privacy or lack thereof is up to you.

Arabic/Russian Language Internet

Thursday, October 6th, 2016

No matter the result of the 2016 US presidential election, mis-information on areas where Arabic and/or Russian are spoken will increase.

If you are creating topic maps and/or want to do useful reporting on such areas consider:

How to get started investigating the Arabic-language internet by Tom Trewinnard, or,

How to get started investigating the Russian-language internet by Aric Toler.

Any hack can quote releases from official sources and leave their readers uninformed.

A journalist takes monotone “facts” from an “official” release and weaves a story of compelling interest to their readers.

Any other guides to language/country specific advice for journalists?

XQuery Snippets on Gist

Thursday, October 6th, 2016

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)


Terrorist HoneyPots?

Thursday, October 6th, 2016

I was reading Checking my honeypot day by Mark Hofman when it occurred to me that discovering CIA/NSA/FBI cybertools may not be as hard as I previously thought.

Imagine creating a <insert-current-popular-terrorist-group-name> website, replete with content ripped off from other terrorist websites, including those sponsored by the U.S. government.

Sharpen your skills at creating fake Twitter followers, AI-generated tweets, etc.

Instead of getting a Booz Allen staffer to betray their employer, you can sit back and collect exploits as they are used.

With just a little imagination, you can create honeypots on and off the Dark Web to attract particular intelligence or law enforcement agencies, security software companies, political hackers and others.

If the FBI can run a porn site, you can use a honeypot to collect offensive cyberweapons.

Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files)

Wednesday, October 5th, 2016

However amusing the headline ‘Guccifer 2.0’ Is Bullshitting Us About His Alleged Clinton Foundation Hack may be, Lorenzo Fanchschi-Bicchierai offers no factual evidence to support his claim,

… the hacker’s latest alleged feat appears to be a complete lie.

Or should I say that:

  • Clinton Foundation denies it has been hacked
  • The Hill whines about who is a donor where
  • The Daily Caller says, “nothing to see here, move along, move along”

hardly qualifies as anything I would rely on.

Checking the file names is one rough check for duplication.

First, you need a set of the file names for all the releases on Guccifer 2.0’s blog:

Relying on file names alone is iffy as the same “content” can be in files with different names, or different content in files with the same name. But this is a rough cut against thousands of documents, so file names it is.

So you can check my work, I saved a copy of the files listed at the blog in date order: guccifer2.0-File-List-By-Blog-Date.txt..

For combining files for use with uniq, you will need a sorted, uniq version of that file: guccifer2.0-File-List-Blog-Sorted-Uniq-lc-final.txt.

Next, there was a major dump of files under the file name 7dc58-ngp-van.7z, approximately 820 MB of files. (Not listed on the blog but from Guccifer 2.0.)

You can use your favorite tool set or grab a copy of: 7dc58-ngp-van-Sorted-Uniq-lc-final.txt.

You need to combine those file names with those from the blog to get a starting set of names for comparison against the alleged Clinton Foundation hack.

Combining those two file name lists together, sorting them and creating a unique list of file names results in: guccifer2.0-30Sept2016-Sorted-Unique.txt.

Follow the same process for ebd-cf.7z, the file that dropped on the 3rd of October 2016. Or grab: ebd-cf-file-Sorted-Uniq-lc-final.txt.

Next, combine guccifer2.0-30Sept2016-Sorted-Unique.txt (the files we knew about before the 3rd of October) with ebd-cf-file-Sorted-Uniq.txt, and sort those file names, resulting in: guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt.

The final step is to apply uniq -d to guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt, which should give you the duplicate files, comparing the files in ebd-cf.7z to those known before September 30, 2016.

The results?

11-26-08 nfc members raised.xlsx

Seven files out of 2085 doesn’t sound like a high degree of duplication.

At least not to me.


PS: On the allegations about the Russians, you could ask the Communists in the State Department or try the Army General Staff. 😉 Some of McCarty’s records are opening up if you need leads.

PPS: Use the final sorted, unique file list to check future releases by Guccifer 2.0. It might help you avoid bullshitting the public.