Phishing, The 43% Option

March 11th, 2018

How’s that for a motivational poster?

You can, and some do, spend hours plumbing in the depths of code or chip design for vulnerabilities.

Or, you can look behind door #2, the phishing door, and find 43% of data breaches start with phishing.

Phishing doesn’t have the glamor or prestige of finding a Meltdown or Spectre bug.

But, on the other hand, do you want to breach a congressional email account for the 2018 mid-term election, or for the 2038 election?

Just so you know, no rumors of breached congressional email accounts have surfaced, at least not yet.

Ping me if you see any such news.

PS: The tweet points to:, an ad for AT&T.

Spreading “Fake News,” Science Says It Wasn’t Russian Bots

March 11th, 2018

The spread of true and false news online by Soroush Vosoughi, Deb Roy, and Sinan Aral. (Science 09 Mar 2018: Vol. 359, Issue 6380, pp. 1146-1151 DOI: 10.1126/science.aap9559)


We investigated the differential diffusion of all of the verified true and false news stories distributed on Twitter from 2006 to 2017. The data comprise ~126,000 stories tweeted by ~3 million people more than 4.5 million times. We classified news as true or false using information from six independent fact-checking organizations that exhibited 95 to 98% agreement on the classifications. Falsehood diffused significantly farther, faster, deeper, and more broadly than the truth in all categories of information, and the effects were more pronounced for false political news than for false news about terrorism, natural disasters, science, urban legends, or financial information. We found that false news was more novel than true news, which suggests that people were more likely to share novel information. Whereas false stories inspired fear, disgust, and surprise in replies, true stories inspired anticipation, sadness, joy, and trust. Contrary to conventional wisdom, robots accelerated the spread of true and false news at the same rate, implying that false news spreads more than the truth because humans, not robots, are more likely to spread it.

Real data science. The team had access to all the Twitter data and not a cherry-picked selection, which of course can’t be shared due to Twitter rules, or so say ISIS propaganda scholars.

The paper merits a slow read but highlights for the impatient:

  1. Don’t invest in bots or high-profile Twitter users for the 2018 mid-term elections.
  2. Craft messages with a high novelty factor that disfavor your candidates opponents.
  3. Your messages should inspire fear, disgust and surprise.

Democrats working hard to lose the 2018 mid-terms will cry you a river about issues, true facts, engagement on the issues and a host of other ideas used to explain losses to losers.

There’s still time to elect a progressive Congress in 2018.

Are you game?

Contesting the Right to Deliver Disinformation

March 8th, 2018

Eric Singerman reports on a recent conference titled: Understanding and Addressing the Disinformation Ecosystem.

He summarizes the conference saying:

The problem of mis- and disinformation is far more complex than the current obsession with Russian troll factories. It’s the product of the platforms that distribute this information, the audiences that consume it, the journalist and fact-checkers that try to correct it – and even the researchers who study it.

In mid-December, First Draft, the Annenberg School of Communication at the University of Pennsylvania and the Knight Foundation brought academics, journalists, fact-checkers, technologists and funders together in a two-day workshop to discuss the challenges produced by the current disinformation ecosystem. The convening was intended to highlight relevant research, share best-practices, identify key questions of scholarly and practical concern and outline a potential research agenda designed to answer these questions.

In preparation for the workshop, a number of attendees prepared short papers that could act as starting points for discussion. These papers covered a broad range of topics – from the ways that we define false and harmful content, to the dystopian future of computer-generated visual disinformation.

Download the papers here.

Singerman points out the very first essay concedes that “fake news” isn’t anything new. Although I would read Schudson and Zelizer (authors of the first paper) with care. They contend:

Fake news lessened in centrality only in the late 1800s as printed news, particularly in Britain and the United States, came to center on what Jean Chalaby called “fact-centered discursive practices” and people realized that newspapers could compete with one another not simply on the basis of partisan affiliation or on the quality of philosophical and political essays but on the immediacy and accuracy of factual reports (Chalaby 1996).

I’m sorry, that’s just factually incorrect. The 1890’s were the age of “yellow journalism,” a statement confirmed by the Digital Library of America‘s resource collection: Fake News in the 1890s: Yellow Journalism:

Alternative facts, fake news, and post-truth have become common terms in the contemporary news industry. Today, social media platforms allow sensational news to “go viral,” crowdsourced news from ordinary people to compete with professional reporting, and public figures in offices as high as the US presidency to bypass established media outlets when sharing news. However, dramatic reporting in daily news coverage predates the smartphone and tablet by over a century. In the late nineteenth century, the news media war between Joseph Pulitzer’s New York World and William Randolph Hearst’s New York Journal resulted in the rise of yellow journalism, as each newspaper used sensationalism and manipulated facts to increase sales and attract readers.

Many trace the origin of yellow journalism to coverage of the sinking of the USS Maine in Havana Harbor on February 15, 1898, and America’s entry in the Spanish-American War. Both papers’ reporting on this event featured sensational headlines, jaw-dropping images, bold fonts, and aggrandizement of facts, which influenced public opinion and helped incite America’s involvement in what Hearst termed the “Journal’s War.”

The practice, and nomenclature, of yellow journalism actually predates the war, however. It originated with a popular comic strip character known as The Yellow Kid in Hogan’s Alley. Created by Richard F. Outcault in 1895, Hogan’s Alley was published in color by Pulitzer’s New York World. When circulation increased at the New York World, William Randolph Hearst lured Outcault to his newspaper, the New York Journal. Pulitzer fought back by hiring another artist to continue the comic strip in his newspaper.

The period of peak yellow journalism by the two New York papers ended in the late 1890s, and each shifted priorities, but still included investigative exposés, partisan political coverage, and other articles designed to attract readers. Yellow journalism, past and present, conflicts with the principles of journalistic integrity. Today, media consumers will still encounter sensational journalism in print, on television, and online, as media outlets use eye-catching headlines to compete for audiences. To distinguish truth from “fake news,” readers must seek multiple viewpoints, verify sources, and investigate evidence provided by journalists to support their claims.

You can see the evidence relied upon by the DPLA for its claims about yellow dog journalism here: Fake News in the 1890s: Yellow Journalism.

Why Schudson and Zelizer thought Chalaby, J. “Journalism as an Anglo-American Invention,” European Journal of Communication 11 (3), 1996, 303-326, supported their case isn’t clear.

If you read the Chalaby article, you find it is primarily concerned with contrasting the French press with Anglo-American practices, a comparison in which the French come off a distant second best.

More to the point, the New York World, the New York Journal, nor yellowdog journalism appears anywhere in the Chalaby article. Check for yourself: Journalism as an Anglo-American Invention.

Chalaby does claim the origin of “fact-centered discursive practices” in the 1890’s but the absence of any mention of journalism that lead to the Spanish-American war, casts doubt on how much we should credit Chalaby’s knowledge of US journalism.

I haven’t checked the other footnotes of Schudson and Zelizer, I leave that as an exercise for interested readers.

I do think Schudson and Zelizer capture the main driver of concern over “fake news” when they say:

First, there is a great anxiety today about the border between professional journalists and others who through digital media have easy access to promoting their ideas, perspectives, factual reports, pranks, inanities, conspiracy theories, fakes and lies….

Despite being framed as a contest between factual reporting and disinformation, the dispute over disinformation/fake news is over the right to profit from disinformation/fake news.

If you need a modern example of yellow journalism, consider the ongoing media frenzy over Russian “interference” in US elections.

How often do you hear reports of context that include instances of US-sponsored assassinations, funded and armed government overthrows, active military interference with both elections and governments, by the US?

What? Some Russians bought Facebook ads and used election hashtags on Twitter? That compares to overthrowing other governments? The long history of the U.S. interfering with elections elsewhere. (tip of the iceberg)

The constant hyperbole in the “Russian interference” story is a clue that journalists and social media are re-enacting the roles played by the New York World and the New York Journal, which lead to the Spanish-American war.

Truth be told, we should thank social media for the free distribution of disinformation, previously available only by subscription.

Discerning what is or is not accurate information, as always, falls on the shoulders of readers. It has ever been thus.

Confluence: Mapping @apachekafka connect schema types – to usual suspects

March 8th, 2018

Confluence has posted a handy mapping from Kafka connect schema types to MySQL, Oracle, PostgreSQL, SQLite, SQL Server and Vertica.

The sort of information that I will waste 10 to 15 minutes every time I need it. Posting it here means I’ll cut the wasted time down to maybe 5 minutes if I remember I posted about it. 😉

Digital Public Library of America (DPLA) Has New Website!

March 8th, 2018

Announcing the Launch of our New Website (the chest beating announcement)

From the post:

The Digital Public Library of America (DPLA) is pleased to unveil its all-new redesigned website, now live at Created in collaboration with renowned design firm Postlight, DPLA’s new website is more user-centered than ever before, with a focus on the tools, resources, and information that matter most to DPLA researchers and learners of all kinds. In a shift from the former site structure, content that primarily serves DPLA’s network of partners and others interested in deeper involvement with DPLA can now be found on DPLA Pro.

You can boil the post down to two links: DPLA (DPLA Resources) and DPLA Pro (helping DPLA build and spread resources). What more needs to be said?

Oh, yeah, donate to support the DPLA!

Numba Versus C++ – On Wolfram CAs

March 6th, 2018

Numba Versus C++ by David Butts, Gautham Dharuman, Bill Punch and Michael S. Murillo.

Python is a programming language that first appeared in 1991; soon, it will have its 27th birthday. Python was created not as a fast scientific language, but rather as a general-purpose language. You can use Python as a simple scripting language or as an object-oriented language or as a functional language…and beyond; it is very flexible. Today, it is used across an extremely wide range of disciplines and is used by many companies. As such, it has an enormous number of libraries and conferences that attract thousands of people every year.

But, Python is an interpreted language, so it is very slow. Just how slow? It depends, but you can count on about 10-100 times as slow as, say, C/C++. If you want fast code, the general rule is: don’t use Python. However, a few more moments of thought lead to a more nuanced perspective. What if you spend most of the time coding, and little time actually running the code? Perhaps your familiarity with the (slow) language, or its vast set of libraries, actually saves you time overall? And, what if you learned a few tricks that made your Python code itself a bit faster? Maybe that is enough for your needs? In the end, for true high performance computing applications, you will want to explore fast languages like C++; but, not all of our needs fall into that category.

As another example, consider the fact that many applications use two languages, one for the core code and one for the wrapper code; this allows for a smoother interface between the user and the core code. A common use case is C or C++ wrapped by, of course, Python. As a user, you may not even know that the code you are using is in another language! Such a situation is referred to as the “two-language problem”. This situation is great provided you don’t need to work in the core code, or you don’t mind working in two languages – some people don’t mind, but some do. The question then arises: if you are one of those people who would like to work only in the wrapper language, because it was chosen for its user friendliness, what options are available to make that language (Python in this example) fast enough that it can also be used for the core code?

We wanted to explore these ideas a bit further by writing a code in both Python and C++. Our past experience suggested that while Python is very slow, it could be made about as fast as C using the crazily-simple-to-use library Numba. Our basic comparisons here are: basic Python, Numba and C++. Because we are not religious about Python, and you shouldn’t be either, we invited expert C++ programmers to have the chance to speed up the C++ as much as they could (and, boy could they!).

This webpage is highly annoying, in both Mozilla and Chrome. You’ll have to visit to get the full impact.

It is, however, also a great post on using Numba to obtain much faster results while still using Python. The use of Wolfram CAs (cellular automata) as examples is an added bonus.


An Interactive Timeline of the Most Iconic Infographics

March 1st, 2018

Map of Firsts: An Interactive Timeline of the Most Iconic Infographics by R. J. Andrews.

Careful with this one!

You might learn some history as well as discovering an infographic for your next project!


MSDAT: Microsoft SQL Database Attacking Tool

March 1st, 2018

MSDAT: Microsoft SQL Database Attacking Tool

From the webpage:

MSDAT (Microsoft SQL Database Attacking Tool) is an open source penetration testing tool that tests the security of Microsoft SQL Databases remotely.

Usage examples of MSDAT:

  • You have a Microsoft database listening remotely and you want to find valid credentials in order to connect to the database
  • You have a valid Microsoft SQL account on a database and you want to escalate your privileges
  • You have a valid Microsoft SQL account and you want to execute commands on the operating system hosting this DB (xp_cmdshell)

Tested on Microsoft SQL database 2005, 2008 and 2012.

As I mentioned yesterday, you may have to wait a few years until the Office of Personnel Management (OMP) upgrades to a supported version of Microsoft SQL database, but think of the experience you will have gained with MSDAT by that time.

And by the time the OPM upgrades, new critical security flaws will emerge in Microsoft SQL database 2005, 2008 and 2012. Under current management, the OPM is becoming less and less secure over time.

Would it help if I posed a street/aerial view of OPM headquarters in DC? Would that help focus your efforts at dropping infected USB sticks, malware loaded DVDs and insecure sex toys for OPM management to find?

OPM headquarters is not marked on the standard tourist map for DC. The map does suggest a number of other fertile places for your wares.

Liberals Amping Right Wing Conspiracies

February 28th, 2018

You read the headline correctly: Liberals Amping Right Wing Conspiracies.

It’s the only reasonable conclusion after reading Molly McKew‘s post: How Liberals Amped up a Paranoid Shooting Conspiracy Theory.

From the post:

This terminology camouflages the war for minds that is underway on social media platforms, the impact that this has on our cognitive capabilities over time, and the extent to which automation is being engaged to gain advantage. The assumption, for example, that other would-be participants in social media information wars who choose to use these same tactics will gain the same capabilities or advantage is not necessarily true. This is a playing field that is hard to level: Amplification networks have data-driven, machine learning components that work better with refinement over time. You can’t just turn one on and expect it to work perfectly.

The vast amounts of content being uploaded every minute cannot possibly be reviewed by human beings. Algorithms, and the poets who sculpt them, are thus given an increasingly outsized role in the shape of our information environment. Human minds are on a battlefield between warring AIs—caught in the crossfire between forces we can’t see, sometimes as collateral damage and sometimes as unwitting participants. In this blackbox algorithmic wonderland, we don’t know if we are picking up a gun or a shield.

McKew has a great description of the amplification in the Parkland shooting conspiracy case, but it’s after the fact and not a basis for predicting the next amplification event.

Any number of research projects suggest themselves:

  • Observing and testing social media algorithms against content
  • Discerning patterns in amplified content
  • Testing refinement of content
  • Building automated tools to apply lessons in amplification

No doubt all those are underway in various guises for any number of reasons. But are you going to share in those results to protect your causes?

Six Degrees of Wikipedia – Eye Candy or Opportunity for Serendipity?

February 28th, 2018

Six Degrees of Wikipedia

As the name implies, finds the shortest path between two Wikipedia pages. defines serendipity in part as:

In general, serendipity is the act of finding something valuable or delightful when you are not looking for it. In information technology, serendipity often plays a part in the recognition of a new product need or in solving a design problem. Web surfing can be an occasion for serendipity since you sometimes come across a valuable or interesting site when you are looking for something else.

Serendipity requires exposure to things you aren’t looking for, search engines excel at that, but their results are so noisy that serendipity is a rare occurrence.

Six Degrees of Wikipedia may have a different result.

First and foremost, humans created the links, for reasons unknown, that form the six degrees of separation. The resulting six degrees is a snapshot of human input from dozens, if not hundreds, of human actors. All of who had an unknown motivation.

Second, the limitation to six degrees results in a graph and nodes that can be absorbed in a glance.

Compare to the “I can make big and dense graphs” so typical in the “analysis” of social media results. (Hint: If any US government agency is asking, “The Russians did it.” is the correct response. Gin up supporting data on your own.)

Six degrees between topics would make a fascinating way to explore a topic map, especially one that merged topics from different domains. Randomly select labels to appear along side those more familiar to a user. Provoke serendipity!

Covering Human Trafficking … Gulf Arab States (@GIJN)

February 28th, 2018

Guide to Covering Human Trafficking, Forced Labor & Undocumented Migration in Gulf Arab Countries by

From the post:

Over 11 million migrant workers work in the six Middle Eastern countries — Saudi Arabia, Kuwait, the United Arab Emirates, Qatar, Bahrain and Oman — that make up the political and economic alliance known as the Gulf Cooperation Council (GCC). Migrants comprise an extraordinary 67 percent of the labor force in these countries. Reforms in labor laws, adopted by just a few Gulf countries, are rarely implemented.

Abuse of these workers is widespread, with contract violations, dangerous working conditions and unscrupulous traffickers, brokers and employers. Media outlets, both local and international, have generally not covered this topic closely. Journalists attempting to investigate human trafficking and forced labor in the region have faced a lack of information, restrictions on press freedom and security threats. Some have faced detention and deportation.

For these reasons, GIJN, in collaboration with human rights organizations, is launching this first bilingual guide to teach journalists best practices, tools and steps in reporting on human trafficking and forced labor in the Gulf region…

If you are reporting on any aspect of these issues, see also the GINJ’s global Reporting Guide to Human Trafficking & Slavery.

Be aware that residence in a Gulf Arab State isn’t a requirement for reporting on human trafficking.

The top port of entry for human trafficking in the United States is shown on this excerpt of a Google Map:

That’s right, the Hartsfield-Jackson Atlanta International Airport.

Despite knowing their port of entry, Hartsfield-Jackson has yet to make an arrest for human trafficking. (as of May 3, 2017)

Schemes such as Hartsfield-Jackson Wants Travelers to Be the ‘Eyes and Ears’ Detecting Sex Trafficking, may explain their lack of success. Making it everyone’s responsibility means it’s no one’s responsibility.

Improvements aren’t hard to imagine. Separating adults without minors from those traveling with minors would be a first step. Separating minors from their accompanying adults, with native speakers who can speak with the minors privately, plus advertised guarantees of protection in the United States, would be another.

Those who could greatly reduce human trafficking have made a cost/benefit analysis and chosen to allow it to continue. In both the Gulf Arab States, the United States and elsewhere.

I’m hopeful you will reach a different conclusion.

Supporting GIJN,, your local reporters, are all ways to assist in combating human trafficking. Data wranglers of all levels and hackers should volunteer their efforts.

Kiddie Hack – OPM

February 27th, 2018

Is it fair to point out the Office of Personnel Management (OMP) continues to fail to plan upgrades to its security?

That’s right, not OPM security upgrades are failing, but OPM is failing to plan for security upgrades. Three years after 21.5 million current and former fed data records were stolen from the OPM.

The inspector general report reads in part:

While we believe that the Plan is a step in the right direction toward modernizing OPM’s IT environment, it falls short of the requirements outlined in the Appropriations Act. The Plan identifies several modernization-related initiatives and allocates the $11 million amongst these areas, but the Plan does not
identify the full scope of OPM’s modernization effort or contain cost estimates for the individual initiatives or the effort as a whole. All of the other capital budgeting, project planning, and IT security requirements are similarly missing.

At this rate, hackers are stockpiling gear slow enough to work with OPM systems.

Be careful on eBay and other online sources. No doubt the FBI is monitoring purchases of older computer gear.

FastPhotoStyle [Re-writing Dickens]

February 26th, 2018

Start Photo:

Style Photo:

Result Photo (start + style):


There are several other sample transformations at the webpage.

From the webpage:

This code repository contains an implementation of our fast photorealistic style transfer algorithm. Given a content photo and a style photo, the code can transfer the style of the style photo to the content photo. The details of the algorithm behind the code is documented in our arxiv paper. Please cite the paper if this code repository is used in your publications.

Yijun Li (UC Merced), Ming-Yu Liu (NVIDIA), Xueting Li (UC Merced), Ming-Hsuan Yang (NVIDIA, UC Merced), Jan Kautz (NVIDIA)A Closed-form Solution to Photorealistic Image Stylization” arXiv preprint arXiv:1802.06474

Re-writing Dickens:

Marley: Why do you not believe your own eyes?

Scrooge: Software makes them a cheat! A pass of PhotoShop or a round with Gimp, to say nothing of fast photorealistic style transfer algorithms.

Doesn’t have the same ring to it does it?

Forbes Vouches For Public Data Sources

February 26th, 2018

For Forbes readers, a demonstration with one of Bernard Marr’s Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018 (Forbes, Feb. 26, 2018), adds a ring of authenticity to your data. Marr and by extension, Forbes has vouched for these data sets.

Beats the hell out of opera, medieval boys choirs, or irises for your demonstration. 😉

These data sets show up everywhere but a reprint from Forbes to leave with your (hopefully) future client, sets your data set from others.

Tip: As interesting as it is, I’d skip the CERN Open Data unless you are presenting to physicists. Yes? Hint: Pick something relevant to your audience.

Guide to Searching CIA’s Declassified Archives

February 26th, 2018

The ultimate guide to searching CIA’s declassified archives Looking to dig into the Agency’s 70 year history? Here’s where to start by Emma Best.

From the webpage:

While the Agency deserves credit for compiling a basic guide to searching their FOIA reading room, it still omits information or leaves it spread out across the Agency’s website. In one egregious example, the CIA guide to searching the records lists only three content types that users can search for, a review of the metadata compiled by Data.World reveals an addition ninety content types. This guide will tell you everything you need to know to dive into CREST and start searching like a pro.

Great guide for anyone interested in the declassified CIA archives.


#7 Believing that information leads to action (Myth of Liberals)

February 26th, 2018

Top 10 Mistakes in Behavior Change

Slides from Stanford University’s Persuasive Tech Lab,

A great resource whether you are promoting a product, service or trying to “interfere” with an already purchased election.

I have a special fondness for mistake #7 on the slides:

Believing that information leads to action

If you want to lose the 2018 mid-terms or even worse, the presidential election in 2020, you keep believing in “educating” voters.

Ping me if you want to be a winning liberal.

Governments Are Secure, But Only By Your Forbearance (happens-before (HB) graphs)

February 26th, 2018

MeltdownPrime and SpectrePrime: Automatically-Synthesized Attacks Exploiting Invalidation-Based Coherence Protocols by Caroline Trippel, Daniel Lustig, Margaret Martonosi.


The recent Meltdown and Spectre attacks highlight the importance of automated verification techniques for identifying hardware security vulnerabilities. We have developed a tool for synthesizing microarchitecture-specific programs capable of producing any user-specified hardware execution pattern of interest. Our tool takes two inputs: a formal description of (i) a microarchitecture in a domain-specific language, and (ii) a microarchitectural execution pattern of interest, e.g. a threat pattern. All programs synthesized by our tool are capable of producing the specified execution pattern on the supplied microarchitecture.

We used our tool to specify a hardware execution pattern common to Flush+Reload attacks and automatically synthesized security litmus tests representative of those that have been publicly disclosed for conducting Meltdown and Spectre attacks. We also formulated a Prime+Probe threat pattern, enabling our tool to synthesize a new variant of each—MeltdownPrime and SpectrePrime. Both of these new exploits use Prime+Probe approaches to conduct the timing attack. They are both also novel in that they are 2-core attacks which leverage the cache line invalidation mechanism in modern cache coherence protocols. These are the first proposed Prime+Probe variants of Meltdown and Spectre. But more importantly, both Prime attacks exploit invalidation-based coherence protocols to achieve the same level of precision as a Flush+Reload attack. While mitigation techniques in software (e.g., barriers that prevent speculation) will likely be the same for our Prime variants as for original Spectre and Meltdown, we believe that hardware protection against them will be distinct. As a proof of concept, we implemented SpectrePrime as a C program and ran it on an Intel x86 processor, averaging about the same accuracy as Spectre over 100 runs—97.9% for Spectre and 99.95% for SpectrePrime.

A separate paper is under review for the “tool” used in this article so more joy is on your way!

As a bonus, “happens-before (HB) graphs” are used, enabling exercise of those graph skills you built making cluttered Twitter graphs.

Good hunting!

Learning Drawing Skills To Help You Communicate

February 22nd, 2018

I sigh with despair every time I see yet another drawing by Julia Evans.

All of it is clever, clear and without effort on my part, beyond me.

Yeah, it’s the “without effort on my part” that keeps me from learning basic drawing skills.

You’re never going to say of a drawing by me, “There’s a proper Julia Evans!” but I don’t think basic drawing skills beyond me, provided I take the time to practice.

How expensive are guidebooks? Does free sound OK?

By E.G. Lutz, What to Draw and How to Draw It (1913), Drawing Made Easy (1935).

BTW, Lutz inspired Walt Disney with: Animated Cartoons: How They Are Made, Their Origin and Development.

I found this at The Public Domain Review. Support for them is always a good idea.

Of course I would rather be exploring nuances of XQuery, but that’s because XQuery is already familiar.

It’s trying the unfamiliar that leads to new skills, hopefully. 😉

Comparing Comprehensive English Grammars?

February 22nd, 2018

Neal Goldfarb in SCOTUS cites CGEL (Props to Justice Gorsuch and the Supreme Court library) highlights two comprehensive grammars for English.

Both are known by the initials CGEL:

Being the more recent work, Cambridge Grammar of the English Language lists today for $279.30 (1860 pages), whereas Quirk’s 1985 Comprehensive Grammar of the English Language, can be had for $166.08 (1779 pages).

Interesting fact, the acronym CGEL was in use for 17 years by Comprehensive Grammar of the English Language before Cambridge Grammar of the English Language was published, using the same acronym.

Curious how much new information was added by the Cambridge grammar? If you had a machine readable text of both, excluded the examples and then calculated the semantic distance between sections on the same material, you could produce a measurement of the distance between the two texts.

Given the prices of academic texts, standardizing a method of comparison would be a boon to scholars and graduate students!

(No comment on the over-writing of the acronym for Quirk’s work by Cambridge.)

Deep Voice – The Empire Grows Steadily Less Secure

February 22nd, 2018

Baidu AI Can Clone Your Voice in Seconds

From the post:

Baidu’s research arm announced yesterday that its 2017 text-to-speech (TTS) system Deep Voice has learned how to imitate a person’s voice using a mere three seconds of voice sample data.

The technique, known as voice cloning, could be used to personalize virtual assistants such as Apple’s Siri, Google Assistant, Amazon Alexa; and Baidu’s Mandarin virtual assistant platform DuerOS, which supports 50 million devices in China with human-machine conversational interfaces.

In healthcare, voice cloning has helped patients who lost their voices by building a duplicate. Voice cloning may even find traction in the entertainment industry and in social media as a tool for satirists.

Baidu researchers implemented two approaches: speaker adaption and speaker encoding. Both deliver good performance with minimal audio input data, and can be integrated into a multi-speaker generative model in the Deep Voice system with speaker embeddings without degrading quality.

See the post for links to three-second voice clips and other details.


The recent breakthroughs in synthesizing human voices have also raised concerns. AI could potentially downgrade voice identity in real life or with security systems. For example voice technology could be used maliciously against a public figure by creating false statements in their voice. A BBC reporter’s test with his twin brother also demonstrated the capacity for voice mimicking to fool voiceprint security systems.

That’s a concern? 😉

I think cloned voices of battlefield military commanders, cloned politician voices with sex partners, or “known” voices badgering help desk staff into giving up utility plant or other access, those are “concerns.” Or “encouragements,” depending on your interests in such systems.

If You Like “Fake News,” You Will Love “Fake Science”

February 22nd, 2018

Prestigious Science Journals Struggle to Reach Even Average Reliability by Björn Brembs.


In which journal a scientist publishes is considered one of the most crucial factors determining their career. The underlying common assumption is that only the best scientists manage to publish in a highly selective tier of the most prestigious journals. However, data from several lines of evidence suggest that the methodological quality of scientific experiments does not increase with increasing rank of the journal. On the contrary, an accumulating body of evidence suggests the inverse: methodological quality and, consequently, reliability of published research works in several fields may be decreasing with increasing journal rank. The data supporting these conclusions circumvent confounding factors such as increased readership and scrutiny for these journals, focusing instead on quantifiable indicators of methodological soundness in the published literature, relying on, in part, semi-automated data extraction from often thousands of publications at a time. With the accumulating evidence over the last decade grew the realization that the very existence of scholarly journals, due to their inherent hierarchy, constitutes one of the major threats to publicly funded science: hiring, promoting and funding scientists who publish unreliable science eventually erodes public trust in science.

Facts, even “scientific facts,” should be questioned, tested and never blindly accepted.

Knowing a report appears in Nature, or Science, or (zine of your choice), helps you find it. Beyond that, you have to read and evaluate the publication to credit it with more than a place of publication.

Reading beyond abstracts or click-bait headlines, checking footnotes or procedures, do those things very often and you will be in danger of becoming a critical reader. Careful!

Self-Inflicted Insecurity in the Cloud – Selling Legal Firm Data

February 21st, 2018

The self-inflicted insecurity phrase being “…behind your own firewall….”

You can see the rest of the Oracle huffing and puffing here.

The odds of breaching law firm security are increased by:

  • Changing to an unfamiliar computing environment (the cloud), or
  • Changing to unfamiliar security software (cloud firewalls).

Either one is sufficient but together, security breaching errors are nearly certain.

Even with an increase in vulnerability, hackers still face the question of how to monetize law firm data?

The economics and markets for stolen credit card and personal data are fairly well known. The Underground Economy of Data Breaches by Wade Williamson, and Once Stolen, What Do Hackers Do With Your Data?.

Dumping law firm data, such as the Panama Papers, generates a lot of PR but doesn’t add anything to your bank account.

Extracting value from law firm data is a variation on e-discovery, a non-trivial process, briefly described in: the Basics of E-Discovery.

However embarrassing law firm data may be, to its former possessors or their clients, market mechanisms akin to those for credit/personal data have yet to develop.

Pointers to the contrary?

The EFF, Privilege, Revolution

February 20th, 2018

The Revolution and Slack by Gennie Gebhart and Cindy Cohn.

From the post:

The revolution will not be televised, but it may be hosted on Slack. Community groups, activists, and workers in the United States are increasingly gravitating toward the popular collaboration tool to communicate and coordinate efforts. But many of the people using Slack for political organizing and activism are not fully aware of the ways Slack falls short in serving their security needs. Slack has yet to support this community in its default settings or in its ongoing design.

We urge Slack to recognize the community organizers and activists using its platform and take more steps to protect them. In the meantime, this post provides context and things to consider when choosing a platform for political organizing, as well as some tips about how to set Slack up to best protect your community.

Great security advice for organizers and activists who choose to use Slack.

But let’s be realistic about “revolution.” The EFF, community organizers and activists who would use Slack, are by definition, not revolutionaries.

How else would you explain the pantheon of legal cases pursued by the EFF? When the EFF lost, did it seek remedies by other means? Did it take illegal action to protect/avenge injured innocents?

Privilege is what enables people to say, “I’m using the law to oppose to X,” while other people are suffering the consequences of X.

Privilege holders != revolutionaries.

FYI any potential revolutionaries: If “on the Internet, no one knows your a dog,” it’s also true “no one knows you are a government agent.”

Evidence for Power Laws – “…I work scientifically!”

February 17th, 2018

Scant Evidence of Power Laws Found in Real-World Networks by Erica Klarreich.

From the post:

A paper posted online last month has reignited a debate about one of the oldest, most startling claims in the modern era of network science: the proposition that most complex networks in the real world — from the World Wide Web to interacting proteins in a cell — are “scale-free.” Roughly speaking, that means that a few of their nodes should have many more connections than others, following a mathematical formula called a power law, so that there’s no one scale that characterizes the network.

Purely random networks do not obey power laws, so when the early proponents of the scale-free paradigm started seeing power laws in real-world networks in the late 1990s, they viewed them as evidence of a universal organizing principle underlying the formation of these diverse networks. The architecture of scale-freeness, researchers argued, could provide insight into fundamental questions such as how likely a virus is to cause an epidemic, or how easily hackers can disable a network.

An informative and highly entertaining read that reminds me of an exchange between in The Never Ending Story between Atreyu and Engywook.

Engywook’s “scientific specie-ality” is the Southern Oracle. From the transcript:

Atreyu: Have you ever been to the Southern Oracle?

Engywook: Eh… what do YOU think? I work scientifically!

In the context of the movie, Engywook’s answer is deeply ambiguous.

Where do you land on the power law question?

Working with The New York Times API in R

February 17th, 2018

Working with The New York Times API in R by Jonathan D. Fitzgerald.

From the post:

Have you ever come across a resource that you didn’t know existed, but once you find it you wonder how you ever got along without it? I had this feeling earlier this week when I came across the New York Times API. That’s right, the paper of record allows you–with a little bit of programming skills–to query their entire archive and work with the data. Well, it’s important to note that we don’t get the full text of articles, but we do get a lot of metadata and URLs for each of the articles, which means it’s not impossible to get the full text. But still, this is pretty cool.

So, let’s get started! You’re going to want to head over to to get an API Key. While you’re there, check out the selection of APIs on offer–there are over 10, including Article Search, Archive, Books, Comments, Movie Reviews, Top Stories, and more. I’m still digging into each of these myself, so today we’ll focus on Article Search, and I suspect I’ll revisit the NYT API in this space many times going forward. Also at NYT’s developer site, you can use their API Tool feature to try out some queries without writing code. I found this helpful for wrapping my head around the APIs.

A great “getting your feet wet” introduction to the New York Times API in R.

Caution: The line between the New York Times (NYT) and governments is a blurry one. It has cooperated with governments in the past and will do so in the future. If you are betrayed by the NYT, you have no one but yourself to blame.

The same is true for the content of the NYT, past or present. Chance is not the deciding factor on stories being reported in the NYT. It won’t be possible to discern motives in the vast majority of cases but that doesn’t mean they didn’t exist. Treat the “historical” record as carefully as current accounts based on “reliable sources.”

Distributed Systems Seminar [Accounting For Hostile Environments]

February 17th, 2018

Distributed Systems Seminar by Peter Alvaro.

From the webpage:


This graduate seminar will explore distributed systems research, both current and historical, with a particular focus on storage systems and programming models.

Due to fundamental uncertainty in their executions arising from asynchronous communication and partial failure, distributed systems present unique challenges to programmers and users. Moreover, distributed systems are increasingly ubiquitous: nearly all non-trivial systems are now physically distributed. It is no longer possible to relegate responsibility for managing the complexity of distributed systems to a group of expert library or infrastructure writers: all programmers must now be distributed programmers. This is both a crisis and an opportunity.

A great deal of theoretical work in distributed systems establishes important impossibility results, including the famous FLP result, the CAP Theorem, the two generals problem and the impossibility of establishing common knowledge via protocol. These results tell us what we cannot achieve in a distributed system, or more constructively, they tell us about the properties we must trade off for the properties we require when designing or using large-scale systems. But what can we achieve? The history of applied distributed systems work is largely the history of infrastructures — storage systems as well as programming models — that attempt to manage the fundamental complexity of the domain with a variety of abstractions.

This course focuses on these systems, models and languages. We will cover the following topics:

  • Consistency models
  • Large-scale storage systems and data processing frameworks
  • Commit, consensus and synchronization protocols
  • Data replication and partitioning
  • Fault-tolerant design
  • Programming models
  • Distributed programming languages and program analysis
  • Seminal theoretical results in distributed systems


This course is a research seminar: we will focus primarily on reading and discussing conference papers. We will read 1-2 papers (typically 2) per session; for each paper, you will provide a brief summary (about 1 page). The summary should answer some or all of the following questions:

  • What problem does the paper solve? Is is important?
  • How does it solve the problem?
  • What alternative approaches are there? Are they adequately discussed in the reading?
  • How does this work relate to other research, whether covered in this course or not?
  • What specific research questions, if any, does the paper raise for you?

What a great list of readings!

An additional question of each paper: Does It Account For Hostile Environments?

As Alvaro says: “…nearly all non-trivial systems are now physically distributed.”

That’s a rather large attack surface to leave for unknown others, by unknown means, to secure to an unknown degree, on your behalf.

If you make that choice, add “cyber-victim” to your business cards.

If you aren’t already, you will be soon enough.

@GalaxyKate, Generators, Steganographic Fields Forever (+ Secure Message Tip)

February 16th, 2018

Before you skip this post as just being about “pretty images,” know that generators span grammars to constraint solvers. Artistry for sure, but exploration can lead to hard core CS rather quickly.

I stumbled upon a @GalaxyKate‘s Generative Art & Procedural Content Starter Kit

Practical Procedural Generation for Everyone: Thirty or so minutes on YouTube, 86,133 views when I checked the link.

So you want to build a generator: In depth blog post with lots of content and links.

Encyclopedia of Generativity: As far as I can tell, a one issue zine by @GalaxyKate but it will take months to explore.

One resource I found while chasing these links was: Procedural Generation.

Oh, and you owe it to yourself to visit GalaxyKate’s homepage:

The small scale of my blog presentation makes that screenshot a pale imitation of what you will find. Great resource!

There’s no shortage of visual content on the Web, one estimate says in 2017, 74% of all internet traffic was video.

Still, if you practice steganographic concealment of information, you should make the work of the hounds as difficult as possible. Generators are an obvious way of working towards that goal.

One secure message tip: Other than for propaganda, which you want discovered and read, omit any greetings, closings, or other rote content, such as blessings, religious quotes, etc.

The famous German Enigma was broken by messages having the same opening text, routine information, closing text (Heil Hitler!), sending the same message in different encodings. Exploring the Enigma

Or in other words, Don’t repeat famous cryptographic mistakes!

Krita (open source painting program)

February 15th, 2018


Do you know Krita? Not being artistically inclined, I don’t often encounter digital art tools. Judging from the examples though:

I’m missing some great imagery, even if I can’t create the same.

Great graphics can enhance your interfaces, education apps, games, propaganda, etc.

Don’t Delete Evil Data [But Remember the Downside of “Evidence”]

February 14th, 2018

Don’t Delete Evil Data by Lam Thuy Vo.

From the post:

The web needs to be a friendlier place. It needs to be more truthful, less fake. It definitely needs to be less hateful. Most people agree with these notions.

There have been a number of efforts recently to enforce this idea: the Facebook groups and pages operated by Russian actors during the 2016 election have been deleted. None of the Twitter accounts listed in connection to the investigation of the Russian interference with the last presidential election are online anymore. Reddit announced late last fall that it was banning Nazi, white supremacist, and other hate groups.

But even though much harm has been done on these platforms, is the right course of action to erase all these interactions without a trace? So much of what constitutes our information universe is captured online—if foreign actors are manipulating political information we receive and if trolls turn our online existence into hell, there is a case to be made for us to be able to trace back malicious information to its source, rather than simply removing it from public view.

In other words, there is a case to be made to preserve some of this information, to archive it, structure it, and make it accessible to the public. It’s unreasonable to expect social media companies to sidestep consumer privacy protections and to release data attached to online misconduct willy-nilly. But to stop abuse, we need to understand it. We should consider archiving malicious content and related data in responsible ways that allow for researchers, sociologists, and journalists to understand its mechanisms better and, potentially, to demand more accountability from trolls whose actions may forever be deleted without a trace.

By some unspecified mechanism, I would support preservation of all social media. As well as have it publicly available, if it were publicly posted originally. Any restriction or permission to see/use the data will lead to the same abuses we see now.

Twitter, among others, talks about abuse but no one can prove or disprove whatever Twitter cares to say.

There is a downside to preserving social media. You have probably seen the NBC News story on 200,000 tweets that are the smoking gun on Russian interference with the 2016 elections.

Well, except that if you look at the tweets, that’s about as far from a smoking gun on Russian interference as anything you can imagine.

By analogy, that’s why intelligence analysts always say they have evidence and give you their conclusions, but not the evidence. Too much danger you will discover their report is completely fictional.

Or when not wholly fictional, serves their or their agency’s interest.

Keeping evidence is risky business. Just so you are aware.

Wikileaks Has Sprung A Leak

February 14th, 2018

In Leaked Chats, WikiLeaks Discusses Preference for GOP over Clinton, Russia, Trolling, and Feminists They Don’t Like by Micah Lee, Cora Currier.

From the post:

On a Thursday afternoon in November 2015, a light snow was falling outside the windows of the Ecuadorian embassy in London, despite the relatively warm weather, and Julian Assange was inside, sitting at his computer and pondering the upcoming 2016 presidential election in the United States.

In little more than a year, WikiLeaks would be engulfed in a scandal over how it came to publish internal emails that damaged Hillary Clinton’s presidential campaign, and the extent to which it worked with Russian hackers or Donald Trump’s campaign to do so. But in the fall of 2015, Trump was polling at less than 30 percent among Republican voters, neck-and-neck with neurosurgeon Ben Carson, and Assange spoke freely about why WikiLeaks wanted Clinton and the Democrats to lose the election.

“We believe it would be much better for GOP to win,” he typed into a private Twitter direct message group to an assortment of WikiLeaks’ most loyal supporters on Twitter. “Dems+Media+liberals woudl then form a block to reign in their worst qualities,” he wrote. “With Hillary in charge, GOP will be pushing for her worst qualities., dems+media+neoliberals will be mute.” He paused for two minutes before adding, “She’s a bright, well connected, sadistic sociopath.”

Like Wikileaks, the Intercept treats the public like rude children, publishing only what it considers to be newsworthy content:

The archive spans from May 2015 through November 2017 and includes over 11,000 messages, more than 10 percent of them written from the WikiLeaks account. With this article, The Intercept is publishing newsworthy excerpts from the leaked messages.

My criticism of the Intercept’s selective publication of leaks isn’t unique to its criticism of Wikileaks. I have voiced similar concerns about the ICIJ and Wikileaks itself.

I want to believe the Intercept, ICIJ and Wikileaks when they proclaim others have been lying, unfaithful, dishonest, etc.

But that wanting/desire makes it even more important that I critically assess the evidence they advance for their claims.

Selective release of evidence undermines their credibility to be no more than those they accuse.

BTW, if anyone has a journalism 101 guide to writing headlines, send a copy to the Intercept. They need it.

PS: I don’t have an opinion one way or the other on the substance of the Lee/Currier account. I’ve never been threatened with a government missile so can’t say how I would react. Badly I would assume.