Archive for December, 2017

Twitter Taking Sides – Censorship-Wise

Thursday, December 28th, 2017

@wikileaks pointed out that Twitter’s censorship policies are taking sides:

Accounts that affiliate with organizations that use or promote violence against civilians to further their causes. Groups included in this policy will be those that identify as such or engage in activity — both on and off the platform — that promotes violence. This policy does not apply to military or government entities and we will consider exceptions for groups that are currently engaging in (or have engaged in) peaceful resolution.
… (emphasis added)

Does Twitter need a new logo? Birds with government insignia dropping bombs on civilians?

The Coolest Hacks of 2017 [Inspirational Reading for 2018]

Wednesday, December 27th, 2017

The Coolest Hacks of 2017 by Kelly Jackson Higgins.

From the post:

You’d think by now with the pervasiveness of inherently insecure Internet of Things things that creative hacking would be a thing of the past for security researchers. It’s gotten too easy to find security holes and ways to abuse IoT devices; they’re such easy marks.

But our annual look at the coolest hacks we covered this year on Dark Reading shows that, alas, innovation is not dead. Security researchers found intriguing and scary security flaws that can be abused to bend the will of everything from robots to voting machines, and even the wind. They weaponized seemingly benign systems such as back-end servers and machine learning tools in 2017, exposing a potential dark side to these systems.

So grab a cold one from your WiFi-connected smart fridge and take a look at seven of the coolest hacks of the year.

“Dark side” language brings a sense of intrigue and naughtiness. But the “dark side(s)” of any system is just a side that meets different requirements. Such as access without authorization. May not be your requirement but it may be mine, or your government’s.

Let’s drop the dodging and posing as though there is a common interest in cybersecurity. There is no such common interest nor has there even been one. Governments want backdoors, privacy advocates, black marketeers and spies want none. Users want effortless security, while security experts know security ads are just short of actionable fraud.

Cybersecurity marketeers may resist but detail your specific requirements. In writing and appended to your contract.

Streaming SQL for Apache Kafka

Wednesday, December 27th, 2017

Streaming SQL for Apache Kafka by Jojjat Jafarpour.

From the post:

We are very excited to announce the December release of KSQL, the streaming SQL engine for Apache Kafka! As we announced in the November release blog, we are releasing KSQL on a monthly basis to make it even easier for you to get up and running with the latest and greatest functionality of KSQL to solve your own business problems.

The December release, KSQL 0.3, includes both new features that have been requested by our community as well as under-the-hood improvements for better robustness and resource utilization. If you have already been using KSQL, we encourage you to upgrade to this latest version to take advantage of the new functionality and improvements.

The KSQL Github page links to:

  • KSQL Quick Start: Demonstrates a simple workflow using KSQL to write streaming queries against data in Kafka.
  • Clickstream Analysis Demo: Shows how to build an application that performs real-time user analytics.

These are just quick start materials but are your ETL projects ever as simple as USERID to USERID? Or have such semantically transparent fields? Or what Itake to be semantically transparent fields (they may not be).

As I pointed out in Where Do We Write Down Subject Identifications? earlier today, where do I record what I know about what appears in those fields? Including on what basis to merge them with other data?

If you see where KSQL is offering that ability, please ping me because I’m missing it entirely. Thanks!

Where Do We Write Down Subject Identifications?

Wednesday, December 27th, 2017

Modern Data Integration Paradigms by Matthew D. Sarrel, The Bloor Group.


Businesses of all sizes and industries are rapidly transforming to make smarter, data-driven decisions. To accomplish this transformation to digital business , organizations are capturing, storing, and analyzing massive amounts of structured, semi-structured, and unstructured data from a large variety of sources. The rapid explosion in data types and data volume has left many IT and data science/business analyst leaders reeling.

Digital transformation requires a radical shift in how a business marries technology and processes. This isn’t merely improving existing processes, but
rather redesigning them from the ground up and tightly integrating technology. The end result can be a powerful combination of greater efficiency, insight and scale that may even lead to disrupting existing markets. The shift towards reliance on data-driven decisions requires coupling digital information with powerful analytics and business intelligence tools in order to yield well-informed reasoning and business decisions. The greatest value of this data can be realized when it is analyzed rapidly to provide timely business insights. Any process can only be as timely as the underlying technology allows it to be.

Even data produced on a daily basis can exceed the capacity and capabilities of many pre-existing database management systems. This data can be structured or unstructured, static or streaming, and can undergo rapid, often unanticipated, change. It may require real-time or near-real-time transformation to be read into business intelligence (BI) systems. For these reasons, data integration platforms must be flexible and extensible to accommodate business’s types and usage patterns of the data.

There’s the usual homage to the benefits of data integration:

IT leaders should therefore try to integrate data across systems in a way that exposes them using standard and commonly implemented technologies such as SQL and REST. Integrating data, exposing it to applications, analytics and reporting improves productivity, simplifies maintenance, and decreases the amount of time and effort required to make data-driven decisions.

The paper covers, lightly, Operational Data Store (ODS) / Enterprise Data Hub (EDH), Enterprise Data Warehouse (EDW), Logical Data Warehouse (LDW), and Data Lake as data integration options.

Having found existing systems deficient in one or more ways, the report goes on to recommend replacement with Voracity.

To be fair, as described, all four systems plus Voracity are all deficient in the same way. The hard part of data integration, the rub that lies at the heart of the task, is passed over as ETL.

Efficient and correct ETL performance requires knowledge of what column headers, for instance, identify. For instance, from the Enron spreadsheets, can you specify the transformation of the data in the following columns? “A, B, C, D, E, F…” from andrea_ring_15_IFERCnov.xlsx, or “A, B, C, D, E,…” from andy_zipper__129__Success-TradeLog.xlsx?

With enough effort, no doubt you could go through speadsheets of interest and create a mapping sufficient to transform data of interest, but where are you going to write down the facts you established for each column that underlie your transformation?

In topic maps, we may the mistake of mystifying the facts for each column by claiming to talk about subject identity, which has heavy ontological overtones.

What we should have said was we wanted to talk about where do we write down subject identifications?


  1. What do you want to talk about?
  2. Data in column F in andrea_ring_15_IFERCnov.xlsx
  3. Do you want to talk about each entry separately?
  4. What subject is each entry? (date written month/day (no year))
  5. What calendar system was used for the date?
  6. Who created that date entry? (If want to talk about them as well, create a separate topic and an association to the spreadsheet.)
  7. The date is the date of … ?
  8. Conversion rules for dates in column F, such as supplying year.
  9. Merging rules for #2? (date comparison)
  10. Do you want relationship between #2 and the other data in each row? (more associations)

With simple questions, we have documented column F of a particular spreadsheet for any present or future ETL operation. No magic, no logical conundrums, no special query language, just asking what an author or ETL specialist knew but didn’t write down.

There are subtlties such as distinguishing between subject identifiers (identifies a subject, like a wiki page) and subject locators (points to the subject we want to talk about, like a particular spreadsheet) but identifying what you want to talk about (subject identifications and where to write them down) is more familiar than our prior obscurities.

Once those identifications are written down, you can search those identifications to discover the same subjects identified differently or with properties in one identification and not another. Think of it as capturing the human knowledge that resides in the brains of your staff and ETL experts.

The ETL assumed by Bloor Group should be written: ETLD – Extract, Transform, Load, Dump (knowledge). That seems remarkably inefficient and costly to me. You?

Tutorial on Deep Generative Models (slides and video)

Wednesday, December 27th, 2017

Slides for: Tutorial on Deep Generative Models by Shakir Mohamed and Danilo Rezende.


This tutorial will be a review of recent advances in deep generative models. Generative models have a long history at UAI and recent methods have combined the generality of probabilistic reasoning with the scalability of deep learning to develop learning algorithms that have been applied to a wide variety of problems giving state-of-the-art results in image generation, text-to-speech synthesis, and image captioning, amongst many others. Advances in deep generative models are at the forefront of deep learning research because of the promise they offer for allowing data-efficient learning, and for model-based reinforcement learning. At the end of this tutorial, audience member will have a full understanding of the latest advances in generative modelling covering three of the active types of models: Markov models, latent variable models and implicit models, and how these models can be scaled to high dimensional data. The tutorial will expose many questions that remain in this area, and for which thereremains a great deal of opportunity from members of the UAI community.

Deep sledding on the latest developments in deep generative models (August 2017 presentation) that ends with a bibliography starting on slide 84 of 96.

Depending on how much time has passed since the tutorial, try searching the topics as they are covered, keep a bibliography of your finds and compare it to that of the authors.

No Peer Review at FiveThirtyEight

Wednesday, December 27th, 2017

Politics Moves Fast. Peer Review Moves Slow. What’s A Political Scientist To Do? by Maggie Koerth-Baker

From the post:

Politics has a funny way of turning arcane academic debates into something much messier. We’re living in a time when so much in the news cycle feels absurdly urgent and partisan forces are likely to pounce on any piece of empirical data they can find, either to champion it or tear it apart, depending on whether they like the result. That has major implications for many of the ways knowledge enters the public sphere — including how academics publicize their research.

That process has long been dominated by peer review, which is when academic journals put their submissions in front of a panel of researchers to vet the work before publication. But the flaws and limitations of peer review have become more apparent over the past decade or so, and researchers are increasingly publishing their work before other scientists have had a chance to critique it. That’s a shift that matters a lot to scientists, and the public stakes of the debate go way up when the research subject is the 2016 election. There’s a risk, scientists told me, that preliminary research results could end up shaping the very things that research is trying to understand.

The legend of peer review catching and correcting flaws has a long history. A legend much tarnished by the Top 10 Retractions of 2017 and similar reports. Retractions are self admissions of the failure of peer review. By the hundreds.

Withdrawal of papers isn’t the only debunking of peer review. The reports, papers, etc., on the failure of peer review include: “Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals,” Anaesthesia, Carlisle 2017, DOI: 10.1111/anae.13962; “The peer review drugs don’t work” by Richard Smith; “One in 25 papers contains inappropriately duplicated images, screen finds” by Cat Ferguson.

Koerth-Baker’s quoting of Justin Esarey to support peer review is an example of no or failed peer review at FiveThirtyEight.

But, on aggregate, 100 studies that have been peer-reviewed are going to produce higher-quality results than 100 that haven’t been, said Justin Esarey, a political science professor at Rice University who has studied the effects of peer review on social science research. That’s simply because of the standards that are supposed to go along with peer review – clearly reporting a study’s methodology, for instance – and because extra sets of eyes might spot errors the author of a paper overlooked.

Koerth-Baker acknowledges the failures of peer review but since the article is premised upon peer review insulating the public from “bad science,” she runs in Justin Esarey, “…who has studied the effects of peer review on social science research.” One assumes his “studies” are mentioned to embue his statements with an aura of authority.

Debunking Esarey’s authority to comment on the “…effects of peer review on social science research” doesn’t require much effort. If you scan his list of publications you will find Does Peer Review Identify the Best Papers?, which bears the sub-title, A Simulation Study of Editors, Reviewers, and the Social Science Publication Process.

Esarey’s comments on the effectiveness of peer review are not based on fact but on simulations of peer review systems. Useful work no doubt but hardly the confessing witness needed to exonerate peer review in view of its long history of failure.

To save you chasing the Esarey link, the abstract reads:

How does the structure of the peer review process, which can vary from journal to journal, influence the quality of papers published in that journal? In this paper, I study multiple systems of peer review using computational simulation. I find that, under any system I study, a majority of accepted papers will be evaluated by the average reader as not meeting the standards of the journal. Moreover, all systems allow random chance to play a strong role in the acceptance decision. Heterogen eous reviewer and reader standards for scientific quality drive both results. A peer review system with an active editor (who uses desk rejection before review and does not rely strictly on reviewer votes to make decisions ) can mitigate some of these effects.

If there were peer reviewers, editors, etc., at FiveThirtyEight, shouldn’t at least one of them looked beyond the title Does Peer Review Identify the Best Papers? to ask Koerth-Baker what evidence Esarey has for his support of peer review? Or is agreement with Koerth-Baker sufficient?

Peer review persists for a number of unsavory reasons, prestige, professional advancement, enforcement of discipline ideology, pretension of higher quality of publications, let’s not add a false claim of serving the public.

Game of Thrones DVDs for Christmas?

Wednesday, December 27th, 2017

Mining Game of Thrones Scripts with R by Gokhan Ciflikli

If you are serious about defeating all comers to Game of Thrones trivia, then you need to know the scripts cold. (sorry)

Ciflikli introduces you to the quanteda and analysis of the Game of Thrones scripts in a single post saying:

I meant to showcase the quanteda package in my previous post on the Weinstein Effect but had to switch to tidytext at the last minute. Today I will make good on that promise. quanteda is developed by Ken Benoit and maintained by Kohei Watanabe – go LSE! On that note, the first 2018 LondonR meeting will be taking place at the LSE on January 16, so do drop by if you happen to be around. quanteda v1.0 will be unveiled there as well.

Given that I have already used the data I had in mind, I have been trying to identify another interesting (and hopefully less depressing) dataset for this particular calling. Then it snowed in London, and the dire consequences of this supernatural phenomenon were covered extensively by the r/CasualUK/. One thing led to another, and before you know it I was analysing Game of Thrones scripts:

2018, with its mid-term congressional elections, will be a big year for leaked emails, documents, in addition to the usual follies of government.

Text mining/analysis skills you gain with the Game of Thrones scripts will be in high demand by partisans, investigators, prosecutors, just about anyone you can name.

From the quanteda documentation site:

quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
… (emphasis in original)

Once you follow the analysis of the Game of Thrones scripts, what other texts or features of quanteda will catch your eye?


From the Valley of Disinformation Rode the 770 – Opportunity Knocks

Wednesday, December 27th, 2017

More than 700 employees have left the EPA since Scott Pruitt took over by Natasha Geiling.

From the post:

Since Environmental Protection Agency Administrator Scott Pruitt took over the top job at the agency in March, more than 700 employees have either retired, taken voluntary buyouts, or quit, signaling the second-highest exodus of employees from the agency in nearly a decade.

According to agency documents and federal employment statistics, 770 EPA employees departed the agency between April and December, leaving employment levels close to Reagan-era levels of staffing. According to the EPA’s contingency shutdown plan for December, the agency currently has 14,449 employees on board — a marked change from the April contingency plan, which showed a staff of 15,219.

These departures offer journalists a rare opportunity to bleed the government like a stuck pig. From untimely remission of login credentials to acceptance of spear phishing emails, opportunities abound.

Not for “reach it to me” journalists who use sources as shields from potential criminal liability. While their colleagues are imprisoned for the simple act of publication or murdered (as of today in 2017, 42).

Governments have not, are not and will not act in the public interest. Laws that criminalize acquisition of data or documents are a continuation of their failure to act in the public interest.

Journalists who serve the public interest, by exposing the government’s failure to do so, should use any means at their disposal to obtain data and documents that evidence government failure and misconduct.

Are you a journalist serving the public interest or a “reach it to me” journalist, serving the public interest when there’s no threat to you?

xsd2json – XML Schema to JSON Schema Transform

Tuesday, December 26th, 2017

xsd2json by Loren Cahlander.

From the webpage:

XML Schema to JSON Schema Transform – Development and Test Environment

The options that are supported are:

‘keepNamespaces’ – set to true if keeping prefices in the property names is required otherwise prefixes are eliminated.

‘schemaId’ – the name of the schema

#xs:short { “type”: “integer”, “xsdType”: “xs:short”, “minimum”: -32768, “maximum”: 32767, “exclusiveMinimum”: false, “exclusiveMaximum”: false }

To be honest, I can’t imagine straying from Relax-NG, much less converting an XSD schema into a JSON schema.

But, it’s not possible to predict all needs and futures (hint to AI fearests). It will be easier to find xsd2json here than with adware burdened “modern” search engines, should the need arise.

Geocomputation with R – Open Book in Progress – Contribute

Tuesday, December 26th, 2017

Geocomputation with R by Robin Lovelace, Jakub Nowosad, Jannes Muenchow.

Welcome to the online home of Geocomputation with R, a forthcoming book with CRC Press.


p>Inspired by bookdown and other open source projects we are developing this book in the open. Why? To encourage contributions, ensure reproducibility and provide access to the material as it evolves.

The book’s development can be divided into four main phases:

  1. Foundations
  2. Basic applications
  3. Geocomputation methods
  4. Advanced applications

Currently the focus is on Part 2, which we aim to be complete by December. New chapters will be added to this website as the project progresses, hosted at and kept up-to-date thanks to Travis….

Speaking of R and geocomputation, I’ve been trying to remember to post about Geocomputation with R since I encountered it a week or more ago. Not what I expect from CRC Press. That got my attention right away!

Part II, Basic Applications has two chapters, 7 Location analysis and 8 Transport applications.

Layering display of data from different sources should be included under Basic Applications. For example, relying on but not displaying topographic data to calculate line of sight between positions. Perhaps the base display is a high-resolution image overlaid with GPS coordinates at intervals and structures have the line of site colored on their structures.

Other “basic applications” you would suggest?

Looking forward to progress on this volume!

All targets have spatial-temporal locations.

Tuesday, December 26th, 2017


From the about page: is a website and blog for those interested in using R to analyse spatial or spatio-temporal data.

Posts in the last six months to whet your appetite for this blog:

The budget of a government for spatial-temporal software is no indicator of skill with spatial and spatial-temporal data.

How are yours?

Deep Learning for NLP, advancements and trends in 2017

Sunday, December 24th, 2017

Deep Learning for NLP, advancements and trends in 2017 by Javier Couto.

If you didn’t get enough books as presents, Couto solves your reading shortage rather nicely:

Over the past few years, Deep Learning (DL) architectures and algorithms have made impressive advances in fields such as image recognition and speech processing.

Their application to Natural Language Processing (NLP) was less impressive at first, but has now proven to make significant contributions, yielding state-of-the-art results for some common NLP tasks. Named entity recognition (NER), part of speech (POS) tagging or sentiment analysis are some of the problems where neural network models have outperformed traditional approaches. The progress in machine translation is perhaps the most remarkable among all.

In this article I will go through some advancements for NLP in 2017 that rely on DL techniques. I do not pretend to be exhaustive: it would simply be impossible given the vast amount of scientific papers, frameworks and tools available. I just want to share with you some of the works that I liked the most this year. I think 2017 has been a great year for our field. The use of DL in NLP keeps widening, yielding amazing results in some cases, and all signs point to the fact that this trend will not stop.

After skimming this post, I suggest you make a fresh pot of coffee before starting to read and chase the references. It will take several days/pots to finish so it’s best to begin now.

Adversarial Learning Market Opportunity

Sunday, December 24th, 2017

The Pentagon’s New Artificial Intelligence Is Already Hunting Terrorists by Marcus Weisgerber.

From the post:

Earlier this month at an undisclosed location in the Middle East, computers using special algorithms helped intelligence analysts identify objects in a video feed from a small ScanEagle drone over the battlefield.

A few days into the trials, the computer identified objects – people, cars, types of building – correctly about 60 percent of the time. Just over a week on the job – and a handful of on-the-fly software updates later – the machine’s accuracy improved to around 80 percent. Next month, when its creators send the technology back to war with more software and hardware updates, they believe it will become even more accurate.

It’s an early win for a small team of just 12 people who started working on the project in April. Over the next year, they plan to expand the project to help automate the analysis of video feeds coming from large drones – and that’s just the beginning.

“What we’re setting the stage for is a future of human-machine teaming,” said Air Force Lt. Gen. John N.T. “Jack” Shanahan, director for defense intelligence for warfighter support, the Pentagon general who is overseeing the effort. Shanahan believes the concept will revolutionize the way the military fights.

So you will recognize Air Force Lt. Gen. John N.T. “Jack” Shanahan (Nvidia conference):

From the Nvidia conference:

Don’t change the culture. Unleash the culture.

That was the message one young officer gave Lt. General John “Jack” Shanahan — the Pentagon’s director for defense for warfighter support — who is hustling to put artificial intelligence and machine learning to work for the U.S. Defense Department.

Highlighting the growing role AI is playing in security, intelligence and defense, Shanahan spoke Wednesday during a keynote address about his team’s use of GPU-driven deep learning at our GPU Technology Conference in Washington.

Shanahan leads Project Maven, an effort launched in April to put machine learning and AI to work, starting with efforts to turn the countless hours of aerial video surveillance collected by the U.S. military into actionable intelligence.

There are at least two market opportunity for adversarial learning. The most obvious one is testing a competitor’s algorithm so it performs less well than yours on “… people, cars, types of building….”

The less obvious market requires US sales of AI-enabled weapon systems to its client states. Client states have an interest in verifying the quality of AI-enabled weapon systems, not to mention non-client states who will be interested in defeating such systems.

For any of those markets, weaponizing adversarial learning and developing a reputation for the same can’t start too soon. Is your anti-AI research department hiring?

Ichano AtHome IP Cameras – Free Vulnerabilities from Amazon

Sunday, December 24th, 2017

SSD Advisory – Ichano AtHome IP Cameras Multiple Vulnerabilities

Catalin Cimpanu @campuscodi pointed to these free vulnerabilities:

AtHome Camera is “a remote video surveillance app which turns your personal computer, smart TV/set-top box, smart phone, and tablet into a professional video monitoring system in a minute.”

The vulnerabilities found are:

  • Hard-coded username and password – telnet
  • Hard-coded username and password – Web server
  • Unauthenticated Remote Code Execution

Did you know the AtHome Camera – Remote video surveillance, Home security, Monitoring, IP Camera by iChano is a free download at Amazon?

That’s right! You can get all three of these vulnerabilities for free! Ranked “#270 in Apps & Games > Utilities,” as of 24 December 2017.

Context Sensitive English Glosses and Interlinears – Greek New Testament

Sunday, December 24th, 2017

Context Sensitive English Glosses and Interlinears by Jonathan Robie.

From the post:

I am working on making the greeksyntax package for Jupyter more user-friendly in various ways, and one of the obvious ways to do that is to provide English glosses.

Contextual glosses in English are now available in the Nestle 1904 Lowfat trees. These glosses have been available in the Nestle1904 repository, where they were extracted from the Berean Interlinear Bible with their generous permission. I merged them into the Nestle 1904 Lowfat treebank using this query. And now they are available whenever you use this treebank.

Another improvement in the resources available to non-professionals who study the Greek New Testament.

Nestle 1904 isn’t the latest work but then the Greek New Testament isn’t the hotbed of revision it once was. 😉

If you are curious why the latest editions of the Greek New Testament aren’t freely available to the public, you will have to ask the scholars who publish them.

My explanation for hoarding of the biblical text isn’t a generous one.

Sleuth Kit – Checking Your Footprints (if any)

Sunday, December 24th, 2017

Open Source File System Digital Forensics: The Sleuth Kit

From the webpage:

The Sleuth Kit is an open source forensic toolkit for analyzing Microsoft and UNIX file systems and disks. The Sleuth Kit enables investigators to identify and recover evidence from images acquired during incident response or from live systems. The Sleuth Kit is open source, which allows investigators to verify the actions of the tool or customize it to specific needs.

The Sleuth Kit uses code from the file system analysis tools of The Coroner’s Toolkit (TCT) by Wietse Venema and Dan Farmer. The TCT code was modified for platform independence. In addition, support was added for the NTFS and FAT file systems. Previously, The Sleuth Kit was called The @stake Sleuth Kit (TASK). The Sleuth Kit is now independent of any commercial or academic organizations.

It is recommended that these command line tools can be used with the Autopsy Forensic Browser. Autopsy is a graphical interface to the tools of The Sleuth Kit and automates many of the procedures and provides features such as image searching and MD5 image integrity checks.

As with any investigation tool, any results found with The Sleuth Kit should be be recreated with a second tool to verify the data.

The Sleuth Kit allows one to analyze a disk or file system image created by ‘dd’, or a similar application that creates a raw image. These tools are low-level and each performs a single task. When used together, they can perform a full analysis.

Question: Who should find your foot prints first? You or someone investigating an incident?

Test your penetration techniques for foot prints before someone else does. Yes?

BTW, pick up a copy of the Autopsy Forensic Browser.

Unix Magnificent Seven + Bash (MorphGNT)

Sunday, December 24th, 2017

Some Unix Command Line Exercises Using MorphGNT by James Tauber.

From the post:

I thought I’d help a friend learn some basic Unix command line (although pretty comprehensive for this tpe of work) with some practical graded exercises using MorphGNT. It worked out well so I thought I’d share in case they are useful to others.

The point here is not to actually teach how to use bash or commands like grep, awk, cut, sort, uniq, head or wc but rather to motivate their use in a gradual fashion with real use cases and to structure what to actually look up when learning how to use them.

This little set of commands has served me well for over twenty years working with MorphGNT in its various iterations (although I obviously switch to Python for anything more complex).
… (emphasis in original)

Great demonstration of what the Unix Magnificent Seven + bash can accomplish.

Oh, MorphGNT, Linguistic Databases and Python Tools for the Greek New Testament.

Next victim of your Unix text hacking skills?

A/B Tests for Disinformation/Fake News?

Sunday, December 24th, 2017

Digital Shadows says it:

Digital Shadows monitors, manages, and remediates digital risk across the widest range of sources on the visible, deep, and dark web to protect your organization.

It recently published The Business of Disinformation: A Taxonomy – Fake news is more than a political battlecry.

It’s not long, fourteen (14) pages and it has the usual claims about disinformation and fake news you know from other sources.

However, for all its breathless prose and promotion of its solution, there is no mention of any A/B tests to show that disinformation or fake news is effective in general or against you in particular.

The value proposition offered by Digital Shadows is everyone says disinformation and fake news are important, therefore spend money with us to combat it.

Alien abduction would be important but I won’t be buying alien abduction insurance or protection services any time soon.

Proof of the effectiveness of disinformation and fake news is on a par with proof of alien abduction.

Anything possible but spending money or creating policies requires proof.

Where’s the proof for the effectiveness of disinformation or fake news? No proof, no spending. Yes?

SMB – 1 billion vulnerable machines

Thursday, December 21st, 2017

An Introduction to SMB for Network Security Analysts by Nate “Doomsday” Marx.

Of all the common protocols a new analyst encounters, perhaps none is quite as impenetrable as Server Message Block (SMB). Its enormous size, sparse documentation, and wide variety of uses can make it one of the most intimidating protocols for junior analysts to learn. But SMB is vitally important: lateral movement in Windows Active Directory environments can be the difference between a minor and a catastrophic breach, and almost all publicly available techniques for this movement involve SMB in some way. While there are numerous guides to certain aspects of SMB available, I found a dearth of material that was accessible, thorough, and targeted towards network analysis. The goal of this guide is to explain this confusing protocol in a way that helps new analysts immediately start threat hunting with it in their networks, ignoring the irrelevant minutiae that seem to form the core of most SMB primers and focusing instead on the kinds of threats an analyst is most likely to see. This guide necessarily sacrifices completeness for accessibility: further in-depth reading is provided in footnotes. There are numerous simplifications throughout to make the basic operation of the protocol more clear; the fact that they are simplifications will not always be highlighted. Lastly, since this guide is an attempt to explain the SMB protocol from a network perspective, the discussion of host based information (windows logs, for example) has been omitted.

It never occurred to me that NTLM, introduced with Windows NT in 1993, is still supported in the latest version of Windows.

That means a deep knowledge of SMB pushes systems vulnerable to you almost north of 1 billion.

How’s that for a line in your CV?

Keeper Security – Beyond Boo-Hooing Over Security Bullies

Thursday, December 21st, 2017

Security firm Keeper sues news reporter over vulnerability story by Zack Whittaker.

From the post:

Keeper, a password manager software maker, has filed a lawsuit against a news reporter and its publication after a story was posted reporting a vulnerability disclosure.

Dan Goodin, security editor at Ars Technica, was named defendant in a suit filed Tuesday by Chicago-based Keeper Security, which accused Goodin of “false and misleading statements” about the company’s password manager.

Goodin’s story, posted December 15, cited Google security researcher Tavis Ormandy, who said in a vulnerability disclosure report he posted a day earlier that a security flaw in Keeper allowed “any website to steal any password” through the password manager’s browser extension.

Goodin was one of the first to cover news of the vulnerability disclosure. He wrote that the password manager was bundled in some versions of Windows 10. When Ormandy tested the bundled password manager, he found a password stealing bug that was nearly identical to one he previously discovered in 2016.

Ormandy also posted a proof-of-concept exploit for the new vulnerability.

I’ll spare you the boo-hooing over Keeper Security‘s attempt to bully Dan Goodin and Ars Technica.

Social media criticism is like the vice-presidency, it’s not worth a warm bucket of piss.

What the hand-wringers over the bullying of Dan Goodin and Ars Technica fail to mention is your ability to no longer use Keeper Security. Not a word.

In The Best Password Managers of 2018, I see ten (10) top password managers, three of which are rated as equal to or better than Keeper Security.

Sadly I don’t use Keeper Security so I can’t send tweet #1: I refuse to use/renew Keeper Security until it abandons persecution of @dangoodin001 and @arstechnica, plus pays their legal fees.

I’m left with tweet #2: I refuse to consider using Keeper Security until it abandons persecution of @dangoodin001 and @arstechnica, plus pays their legal fees.

Choose tweet 1 or 2, ask your friends to take action, and to retweet.

Emacs X Window Manager

Thursday, December 21st, 2017

Emacs X Window Manager by Chris Feng.

From the webpage:

EXWM (Emacs X Window Manager) is a full-featured tiling X window manager for Emacs built on top of XELB. It features:

  • Fully keyboard-driven operations
  • Hybrid layout modes (tiling & stacking)
  • Dynamic workspace support
  • ICCCM/EWMH compliance
  • (Optional) RandR (multi-monitor) support
  • (Optional) Built-in compositing manager
  • (Optional) Built-in system tray

Please check out the screenshots to get an overview of what EXWM is capable of, and the user guide for a detailed explanation of its usage.

Note: If you install EXWM from source, it’s recommended to install XELB also from source (otherwise install both from GNU ELPA).

OK, one screenshot:

BTW, EXWM supports multiple monitors as well.


Learn to Write Command Line Utilities in R

Thursday, December 21st, 2017

Learn to Write Command Line Utilities in R by Mark Sellors.

From the post:

Do you know some R? Have you ever wanted to write your own command line utilities, but didn’t know where to start? Do you like Harry Potter?

If the answer to these questions is “Yes!”, then you’ve come to the right place. If the answer is “No”, but you have some free time, stick around anyway, it might be fun!

Sellors invokes the tradition of *nix command line tools saying: “The thing that most [command line] tools have in common is that they do a small number of things really well.”

The question to you is: What small things do you want to do really well?

Weird machines, exploitability, and provable unexploitability

Thursday, December 21st, 2017

Weird machines, exploitability, and provable unexploitability by Thomas Dullien (IEEE pre-print, to appear IEEE Transactions on Emerging Topics in Computing)


The concept of exploit is central to computer security, particularly in the context of memory corruptions. Yet, in spite of the centrality of the concept and voluminous descriptions of various exploitation techniques or countermeasures, a good theoretical framework for describing and reasoning about exploitation has not yet been put forward.

A body of concepts and folk theorems exists in the community of exploitation practitioners; unfortunately, these concepts are rarely written down or made sufficiently precise for people outside of this community to benefit from them.

This paper clarifies a number of these concepts, provides a clear definition of exploit, a clear definition of the concept of a weird machine, and how programming of a weird machine leads to exploitation. The papers also shows, somewhat counterintuitively, that it is feasible to design some software in a way that even powerful attackers – with the ability to corrupt memory once – cannot gain an advantage.

The approach in this paper is focused on memory corruptions. While it can be applied to many security vulnerabilities introduced by other programming mistakes, it does not address side channel attacks, protocol weaknesses, or security problems that are present by design.

A common vocabulary to bridge the gap between ‘Exploit practitioners’ (EPs) and academic researchers. Whether it will in fact bridge that gap remains to be seen. Even the attempt will prove to be useful.

Tracing the use/propagation of Dullien’s vocabulary across Google’s Project Zero reports and papers would provide a unique data set on the spread (or not) of a new vocabulary in computer science.

Not to mention being a way to map back into earlier literature with the newer vocabulary, via a topic map.

BTW, Dullien’s statement “is is feasible to design some software in a way that even powerful attackers … cannot gain an advantage,” is speculation and should not dampen your holiday spirits. (I root for the hare and not the hounds as a rule.)

Nine Kinds of Ancient Greek Treebanks

Thursday, December 21st, 2017

Nine Kinds of Ancient Greek Treebanks by Jonathan Robie.

When I blog or speak about Greek treebanks, I frequently refer to one or more of the treebanks that are currently available. Few people realize how many treebanks exist for ancient Greek, and even fewer have ever seriously looked at more than one. I do not know of a web page that lists all of the ones I know of, so I thought it would be helpful to list them in one blog post, providing basic information about each.

So here is a catalog of treebanks for ancient Greek.

Most readers of this blog know Jonathan Robie from his work on XQuery and XPath, two of the XML projects that have benefited from his leadership.

What readers may not know is that Jonathan originated both b-greek (Biblical Greek Forum, est. 1992) and b-hebrew (Biblical Hebrew Forum, est. 1997). Those are not typos, b-greek began in 1992 and b-hebrew in 1997. (I checked the archives before posting.)

Not content to be the origin and maintainer of two of the standard discussion forums for biblical languages, Jonathan has undertaken to produce high quality open data for serious Bible students and professional scholars.

Texts in multiple treebanks, such as the Greek NT, make a great use case for display and analysis of overlapping trees.

Violating TCP

Wednesday, December 20th, 2017

This is strictly a violation of the TCP specification by Marek Majkowski.

From the post:

I was asked to debug another weird issue on our network. Apparently every now and then a connection going through CloudFlare would time out with 522 HTTP error.

522 error on CloudFlare indicates a connection issue between our edge server and the origin server. Most often the blame is on the origin server side – the origin server is slow, offline or encountering high packet loss. Less often the problem is on our side.

In the case I was debugging it was neither. The internet connectivity between CloudFlare and origin was perfect. No packet loss, flat latency. So why did we see a 522 error?

The root cause of this issue was pretty complex. After a lot of debugging we identified an important symptom: sometimes, once in thousands of runs, our test program failed to establish a connection between two daemons on the same machine. To be precise, an NGINX instance was trying to establish a TCP connection to our internal acceleration service on localhost. This failed with a timeout error.

It’s unlikely that you will encounter this issue but Majkowski’s debugging of it is a great story.

It also illustrates how deep the foundations of an error, bug or vulnerability may lie.

Is it a vehicle? A helicopter? No, it’s a rifle! Messing with Machine Learning

Wednesday, December 20th, 2017

Partial Information Attacks on Real-world AI

From the post:

We’ve developed a query-efficient approach for finding adversarial examples for black-box machine learning classifiers. We can even produce adversarial examples in the partial information black-box setting, where the attacker only gets access to “scores” for a small number of likely classes, as is the case with commercial services such as Google Cloud Vision (GCV).

The post is a quick read (est. 2 minutes) with references but you really need to see:

Query-efficient Black-box Adversarial Examples by Andrew Ilyas, Logan Engstrom, Anish Athalye, Jessy Lin.


Current neural network-based image classifiers are susceptible to adversarial examples, even in the black-box setting, where the attacker is limited to query access without access to gradients. Previous methods — substitute networks and coordinate-based finite-difference methods — are either unreliable or query-inefficient, making these methods impractical for certain problems.

We introduce a new method for reliably generating adversarial examples under more restricted, practical black-box threat models. First, we apply natural evolution strategies to perform black-box attacks using two to three orders of magnitude fewer queries than previous methods. Second, we introduce a new algorithm to perform targeted adversarial attacks in the partial-information setting, where the attacker only has access to a limited number of target classes. Using these techniques, we successfully perform the first targeted adversarial attack against a commercially deployed machine learning system, the Google Cloud Vision API, in the partial information setting.

The paper contains this example:

How does it go? Seeing is believing!

Defeating image classifiers will be an exploding market for jewel merchants, bankers, diplomats, and others with reasons to avoid being captured by modern image classification systems.

Offensive Security Conference – February 12-17 2018 // Berlin

Wednesday, December 20th, 2017

Offensive Security Conference – February 12-17 2018 // Berlin

If you haven’t already registered/made travel arrangements, perhaps the speakers list will hurry you along.

While you wait for the conference, can you match the author(s) to the papers based on title alone? Several papers have multiple authors, but which ones?


What’s in Your Wallet? Photo Defeats Windows 10 Facial Recognition

Wednesday, December 20th, 2017

It took more than a wallet-sized photo, but until patched, the Window 10 Hello facial recognition feature accepted a near IR printed (340×340 pixel) image to access a Windows device.

Catalin Cimpanu has the details at: Windows 10 Facial Recognition Feature Can Be Bypassed with a Photo.

The disturbing line in Cipanu’s report reads:

The feature is not that widespread since not many devices with the necessary hardware, yet when present, it is often used since it’s quite useful at unlocking computers without having users type in long passwords.

When hardware support for Windows Hello spreads, you can imagine its default use in corporate and government offices.

The Microsoft patch may defeat a 2-D near IR image but for the future, I’d invest in a 3-D printer with the ability to print in the near IR.

I don’t think your Guy Fawkes mask will work on most Windows devices:

But it might make a useful “cover” for a less common mask. If security forces have to search every Guy Fawkes mask, some Guy Fawkes+ masks are bound to slip through. Statistically speaking.

Was that Stevie Nicks or Tacotron 2.0? ML Singing in 2018

Tuesday, December 19th, 2017

[S]amim @samim tweeted:

In 2018, machine learning based singing vocal synthesisers will go mainstream. It will transform the music industry beyond recognition.

With these two links:

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, et al.


This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.


Audio samples from “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”

Try the samples before dismissing the prediction of machine learning singing in 2018.

I have a different question:

What is in your test set for ML singing?

Among my top picks, Stevie Nicks, Janis Joplin, and of course, Grace Slick.

Practicing Vulnerability Hunting in Programming Languages for Music

Tuesday, December 19th, 2017

If you watched Natalie Silvanovich‘s presentation on mining the JavaScript standard for vulnerabilities, the tweet from Computer Science @CompSciFact pointing to Programming Languages Used for Music must have you drooling like one of Pavlov‘s dogs.

I count one hundred and forty-seven (147) languages, of varying degrees of popularity, none of which has gotten the security review of ECMA-262. (Michael Aranda wades through terminology/naming issues for ECMAScript vs. JavaScript at: What’s the difference between JavaScript and ECMAScript?.)

Good hunting!