Archive for July, 2015

Robotic Article Tagging (OpenOffice Idea)

Friday, July 31st, 2015

The New York Times built a robot to help make article tagging easier by Justin Ellis.

From the post:

If you write online, you know that a final, tedious part of the process is adding tags to your story before sending it out to the wider world.

Tags and keywords in articles help readers dig deeper into related stories and topics, and give search audiences another way to discover stories. A Nieman Lab reader could go down a rabbit hole of tags, finding all our stories mentioning Snapchat, Nick Denton, or Mystery Science Theater 3000.

Those tags can also help newsrooms create new products and find inventive ways of collecting content. That’s one reason The New York Times Research and Development lab is experimenting with a new tool that automates the tagging process using machine learning — and does it in real time.

The Times R&D Editor tool analyzes text as it’s written and suggests tags along the way, in much the way that spell-check tools highlight misspelled words:

Great post but why not take the “…in much the way that spell-check tools highlight misspelled words” just a step further?

Apache OpenOffice already has spell-checking, so why not improve it to have automatic tagging?

You may or may not know that Open Document Format (ODF) 1.2 was just published as an ISO standard!

Which is the format used by Apache OpenOffice.

Open Document Format (ODF) 1.2 supports RDFa for inline metadata.

Now, imagine for a moment using standard office suite software (Apache OpenOffice) to choose a metadata dictionary and have your content automatically tagged as you type or to insert a document and tags are automatically inserted into the text.

Does that sound like a killer application for your corner of the woods?

A universal dictionary of RDFa tags might be a real memory hog but how many different tags would you need day to day? That’s even an empirical question that could be answered by indexing your documents for the past six (6) months.

With very little effort on the part of users, you can transform your documents from unstructured text to tagged (and proofed) text.

Assemble at the Apache OpenOffice (or LibreOffice) projects if an easy-to-use, easy-to-modify tagging system for office suite software appeals to you.

For other software projects supporting ODF, see: OpenDocument software.

PS: Work is current underway at the ODF TC (OASIS) on robust change tracking support. All we are missing is you.

Windows 10: Steady as you go

Friday, July 31st, 2015

Windows 10: You might be wise to wait before upgrading by Graham Cluley.

If Windows 10 isn’t your first Windows rodeo, you know the reasons for Graham’s advice on waiting a while to upgrade to Windows 10.

For example, Microsoft delivers a massive Windows 10 patch to fix early bugs by Jamie Hinks.

Doesn’t hurt to let someone else debug the early version. 😉

Who Is Tipping Scales to Cyber Attackers?

Thursday, July 30th, 2015

You don’t have to read very far into Scott Gainey’s The Economics of Cybersecurity – Are Scales Tipped to the Attacker? to get the impression that Scott accepts cyberinsecurity as a default state of affairs.

From the post:

An argument can certainly be made that the economics of cybersecurity largely favor the attacker. While the takedown of Darkode was a win for the good guys, at least temporarily, the unfortunate reality is there remains a multitude of other underground forums where criminals can gain easy access to the tools and technical support needed to organize and execute an attack. A simple search can get you quick access to virtually any tool needed for the job. Our role as executives and security professionals is to make sure these adversaries roaming these virtual havens of nastiness have to spend an inordinate amount of resources to try and achieve their objectives.

Many organizations are working to tip the scales back in their favor through a more integrated approach to security that not only includes increased spending and coordination across technology use and deployment; but also are looking at how they can improve overall efficacy through improved people training and policy management. These changes obviously come at a cost.

Many organizations are asking the natural question – how much do I really need to spend on security in order to tip the scales in my favor? In order to answer that question you must first quantify the impact and risk of a cyber attack.

The current economics of software and hardware creation shift the burden of security defects to the end user. That’s why the questions posed in Scott’s post are by users trying to tip the security scales back into their favor.

That starts the discussion in the wrong place. Users to address security issues with more software produced by the same processes that put them at risk? Can you see any reason or that to not fill me with confidence?

Moreover, fixing cybersecurity issues with software, at the source of its creation, places the cost of that fix on the person/s best able to make the repair. Which in turn saves thousands of other users the cost of defending against that particular cyberrisk.

In the short run, we will all have to battle cyberinsecurity but let’s also take names and assign responsibility for the defects that we do encounter.

The Islamic State/Social Media – US Military/Popular Entertainment

Wednesday, July 29th, 2015

For all of the government sponsored hysteria over the use of social media by the Islamic State, there has been no, repeat no hard evidence of its being “successful.”

Or at least not by any rational definition of “successful.” OK, so one or two impressionable teens in the UK attempt to join the Islamic State. How is that level of threat even marketable?

The situation is even worse in the United States where the FBI badgers emotionally unstable people into saying they want to go help the Islamic State and then arrest them for attempting to provide material assistance. That’s a real stretch.

Why not compare the “success” of the US military in using popular entertainment to carry its message versus that of the Islamic State using social media?

Tom Secker has recently produced a treasure trove of documents on the influence of the US military on popular entertainment, especially television shows.

From his post:

In the biggest public release of documents from the DOD’s propaganda office I recently received over 1500 pages of new material. Just under 1400 pages come from the US Army’s Entertainment Liaison Office: regular activity reports covering January 2010 to April 2015. Another over 100 pages of reports come from the US Air Force’s office, covering 2013.

The request I filed asked for all such reports since the last release covering 2005-2006 (Army documents here, Air Force documents here) but due a variety of excuses the release was limited to these 1500 pages. The Air Force said that no documents were available from 2015 ‘due to ongoing computer outages in the Los Angeles office’. If you believe that then you’ll believe anything. While the Army documents are fully digitised and easily searchable, the Air Force ones are mid-resolution scans of printed out emails/online network files.

Meanwhile, the documents between 2006-2010 (Army) and 2006-2013 (Air Force) appear to have been destroyed in keeping with the file retention policy. We already knew that the DHS only retains documents from their entertainment office for six years before shredding them like a CREEP finance report. For the military it appears to be even less than that, though given the absurd excuse offered by the Air Force it is possible they have a lot more records than they are admitting to having.

Nonetheless, this is the largest and most up-to-date release of documents from the world of Entertainment Liaison Offices. It substantially increases our knowledge of the scale and type of involvement the US military has in popular entertainment, particularly TV shows. However, details of changes to films and TV shows requested by the DOD in exchange for their co-operation are conspicuously absent (no surprises there).

Given the disparity in the size and scope of the United States versus Islamic State media campaigns, it appears the United States cannot tolerate any challenge to its world-view.

As a lifelong US citizen, I don’t find that surprising (disappointing but not surprising) at all.

PS: Be sure to check out the documents that Tom has obtained!

Unix™ for Poets

Wednesday, July 29th, 2015

Unix™ for Poets by Kenneth Ward Church.

A very delightful take on using basic Unix tools for text processing.

Exercises cover:

1. Count words in a text

2. Sort a list of words in various ways

  • ascii order
  • dictionary order
  • ‘‘rhyming’’ order

3. Extract useful info from a dictionary

4. Compute ngram statistics

5. Make a Concordance

Fifty-three (53) pages of pure Unix joy!


Text Processing in R

Wednesday, July 29th, 2015

Text Processing in R by Matthew James Denny.

From the webpage:

This tutorial goes over some basic concepts and commands for text processing in R. R is not the only way to process text, nor is it really the best way. Python is the de-facto programming language for processing text, with a lot of builtin functionality that makes it easy to use, and pretty fast, as well as a number of very mature and full featured packages such as NLTK and textblob. Basic shell scripting can also be many orders of magnitude faster for processing extremely large text corpora — for a classic reference see Unix for Poets. Yet there are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. I primarily make use of the stringr package for the following tutorial, so you will want to install it:

Perhaps not the best tool for text processing but if you are inside R and have text processing needs, this will get you started.

The Declining Half-Life of Secrets

Wednesday, July 29th, 2015

The Declining Half-Life of Secrets And the Future of Signals Intelligence by Peter Swire.

Peter Swire writes:

The nature of secrets is changing. The “half-life of secrets” is declining sharply for many signals intelligence and other intelligence activities as secrets that may have been kept successfully for 25 years or more are exposed well before.

For evidence, one need look no further than the 2015 breach at the Office of Personnel Management (OPM), of personnel records for 22 million U.S. government employees and family members. For spy agencies, theft of the security clearance records is uniquely painful – whoever gains access to the breached files will have an unparalleled ability to profile individuals in the intelligence community and subject them to identity theft.

OPM is just one instance in a long string of high-profile breaches, where hackers gain access to personal information, trade secrets, or classified government material. The focus of the discussion here, though, is on complementary trends in information technology, including the continuing effects of Moore’s Law, the sociology of the information technology community, and changed sources and methods for signals intelligence. This article is about those risks of discovery and how the intelligence community must respond.

My views on this subject were formed during my experience as one of five members of President Obama’s Review Group on Intelligence and Communications Technology in 2013. There is a crucial difference between learning about a wiretap on the German Chancellor from three decades ago and learning that a wiretap has targeted the Current German Chancellor, Angela
Merkel, while she is still in office and able to object effectively. In government circles, this alertness to negative consequences is sometimes called “the front-page test,” which describes how our actions will look if they appear on the front page of the newspaper. The front-page test becomes far more important to decision-makers when secrets become known sooner. Even if the secret operation is initially successful, the expected costs of disclosure become higher as the
average time to disclosure decreases.

Peter generously attributes secrecy in the intelligence community to fear of “mosaic theory,” that is that an opponent may be gathering any and all information in an effort to indirectly discover what it cannot discover directly.

While application of “mosaic theory” to an intelligence agency isn’t impossible, revelations from the Pentagon Papers to date have shown criminal misconduct, concealing incompetence, career protection, and a host of other unsavory motives are at least as likely as application of “mosaic theory.”

The intelligence community should recognize secrecy for the sake of concealing criminal misconduct, incompetence, career protection, etc., weakens its claim for protection of legitimate secrets.

Only the intelligence community can clean its own house. The alternative is random disclosure of secrets of varying importance.

This is the first paper of the Cybersecurity Intitiative.

From their about page:

There is perhaps no issue that has grown more important, more rapidly, on so many different levels, than cybersecurity. It affects personal privacy, business prosperity and the wider economy, as well as national security and international relations. It is a field that matters for everything from human rights and corporate profits to fundamental issues of war and peace. And with the rapid growth in both the number of people and devices coming online across the globe, the security of information systems is only going to grow in importance. Yet, while ever more amounts are spent each year, our collective understanding of the problem remains immature and both public policy and private sector efforts have failed to match the scale or complexity of this challenge for us all. The Internet has connected us, but the policies and debates that surround the security of our networks are too often disconnected, disjoint, and stuck in an unsuccessful status quo.

This is what New America’s Cybersecurity Initiative is designed to address. We believe that it takes a wider network to face the network of diverse security issues. Success in this endeavor will require collaboration – across organizations, issue areas, professional fields and business sectors, as well as local, state, and international borders. By highlighting bold new ideas, bringing in new voices with fresh perspectives, breaking down issue and organizational barriers while building up a new field of study, encouraging new research approaches to the next generation of cybersecurity issues, connecting and creating new constituencies, and providing vibrant media and policy platforms to support that creativity, we can aid in pushing forward the cyber policy needed right now and better set us up for success tomorrow.

Prior attempts at cybersecurity have failed. (full stop) Can you guess what the outcome will be from repeating old ideas?

Follow the Cybersecurity Initiative if you want new ideas.

The next Web standard could be music notation

Tuesday, July 28th, 2015

The next Web standard could be music notation by Peter Kirn.

From the post:

The role of the music score is an important one, as a lingua franca – it puts musical information in a format a lot of people can read. And it does that by adhering to standards.

Now with computers, phones, and tablets all over the planet, can music notation adapt?

A new group is working on bringing digital notation as a standard to the Web. The World Wide Web Consortium (W3C) – yes, the folks who bring you other Web standards – formed what they’re describing as a “community group” to work on notation.

That doesn’t mean your next Chrome build will give you lead sheets. W3C are hosting, not endorsing the project – not yet. And there’s a lot of work to be done. But many of the necessary players are onboard, which could mean some musically useful progress.

The news arrived in my inbox by way of Hamburg-based Steinberg. That’s no surprise; we knew back in 2013 that the core team behind Sibelius had arrived at Steinberg after a reorganization at Avid pushed them out of the company they original started.

The other big player in the group is MakeMusic, developers of Finale. And they’re not mincing words: they’re transferring the ownership of the MusicXML interchange format to the new, open group:
MakeMusic Transfers MusicXML Development to W3C []

The next step: make notation work on the Web. Sibelius were, while not the first to put notation on the Web, the first to popularize online sharing as a headline feature in a mainstream notation tool. Sibelius even had a platform for sharing and selling scores, complete with music playback. But that was dependent on a proprietary plug-in – now, the browser is finally catching up, and we can do all of the things Scorch does right in browser.

So, it’s time for an open standard. And the basic foundation already exists. The new W3C Music Notation Community Group promises to “maintain and update” two existing standards – MusicXML and the awkwardly-acronym’ed SMuFL (Standard Music Font Layout). Smuffle sounds like a Muppet from Sesame Street, but okay.

For the W3C group:

Music notation has a long history across cultures. It will be interesting to see what subset of music notation is captured by this effort.

Moving FASTR in the US Senate

Tuesday, July 28th, 2015

Moving FASTR in the US Senate by Peter Suber.

From the post:

FASTR will go to markup tomorrow in the Senate Homeland Security and Governmental Affairs Committee (HSGAC).

Here’s a recap of my recent call-to-action post on FASTR, with some new details and background.

FASTR is the strongest bill ever introduced in Congress requiring open access to federally-funded research.

We already have the 2008 NIH policy, but it only covers one agency. We already have the 2013 Obama directive requiring about two dozen federal agencies to adopt OA mandates, but the next President could rescind it.

FASTR would subsume and extend the NIH policy. FASTR would solidify the Obama directive by grounding these agency policies in legislation. Moreover, FASTR would strengthen the NIH policy and Obama directive by requiring reuse rights or open licensing. It has bipartisan support in both the House and the Senate.

FASTR has been introduced in two sessions of Congress (February 2013 and March 2015), and its predecessor, FRPAA (Federal Research Public Access Act), was introduced in three (May 2006, April 2010, February 2012). Neither FASTR nor FRPAA has gotten to the stage of markup and a committee vote. That’s why tomorrow’s markup is so big.

For the reasons why FASTR is stronger than the Obama directive, see my 2013 article comparing the two.

For steps you can take to support FASTR, see the action pages from the Electronic Frontier Foundation (EFF) and Scholarly Publishing and Academic Resources Coalition (SPARC).

Even though I will be in a day long teleconference tomorrow, I will be contacting my Senators to support FASTR.

How about you?

IoT Pinger (Wandora)

Tuesday, July 28th, 2015

IoT Pinger (Wandora)

From the webpage:

This is an upcoming feature and is not included yet in the public release.

The IoT (Internet of Things) pinger is a general purpose API consumer intended to aggregate data from several different sources providing data via HTTP. The IoT Panel is found in the Wandora menu bar and presents most of the pinger’s configuration options. The Pinger searches the current Topic Map for topics with an occurrence with Source Occurrence Type. Those topics are expected to correspond to an API endpoint defined by corresponding occurrence data. The pinger queries each endpoint every specified time interval and saves the response as an occurrence with Target Occurrence Type. The pinger process can be configured to stop at a set time using the Expires toggle. Save on tick saves the current Topic Map in the specified folder after each tick of the pinger in the form iot_yyyy_mm_dd_hh_mm_ss.jtm.

Now there’s an interesting idea!

Looking forward to the next release!

Big Data to Knowledge (Biomedical)

Tuesday, July 28th, 2015

Big Data to Knowledge (BD2K) Development of Software Tools and Methods for Biomedical Big Data in Targeted Areas of High Need (U01).


Open Date (Earliest Submission Date) September 6, 2015

Letter of Intent Due Date(s) September 6, 2015

Application Due Date(s) October 6, 2015,

Scientific Merit Review February 2016

Advisory Council Review May 2016

Earliest Start Date July 2016

From the webpage:

The purpose of this BD2K Funding Opportunity Announcement (FOA) is to solicit development of software tools and methods in the three topic areas of Data Privacy, Data Repurposing, and Applying Metadata, all as part of the overall BD2K initiative. While this FOA is intended to foster new development, submissions consisting of significant adaptations of existing methods and software are also invited.

The instructions say to submit early so that corrections to your application can be suggested. (Take the advice.)

Topic maps, particularly with customized subject identity rules, are a nice fit to the detailed requirements you will find at the grant site.

Ping me if you are interested in discussing why you should include topic maps in your application.

Android Phones: Precursor to the Internet of Things (IoT)

Tuesday, July 28th, 2015

Graham Cluley does a great job in: Gaping hole in Android lets hackers break in with just your phone number! summarizing how easily your Android phone can be breached. Requirement? Knowing your phone number.

There is a lot of talk about the need for security for the Internet of Things (IoT) but talk isn’t going to keep you secure on the IoT.

Think about it for a moment, the same type of folks that brought you Three Mile Island, exploding Ford Pintos, Chernobyl, Bhopal, Fukushima Daiichi, and the myriad product recalls that litter the daily news, they are going to provide for your security on the IoT?

The time has come and past for liability for software/hardware that allows breaches of security. That is the one remedy that has not been imposed on the computer industry to force the production of more secure code.

If you want your car, SUV, motorcycle, computer, TV, freezer, refrigerator, etc. to be as easy to breach as your Android Phone, say nothing.

If you want some minimal amount of privacy/security in the IoT, call for liability for software/hardware security holes now!

Free Access to Law = ‘Terrorism’

Monday, July 27th, 2015

Georgia claims that publishing its state laws for free online is ‘terrorism’ by Michael Hiltzik.

There isn’t much in the way of government stupidity that surprises me but this caught my eye.

The gist of the case is that the State of Georgia is suing Carl Malamud for making the text of Georgia laws, annotated at the State’s expense, available free to the public.

If you want to fight this type of government idiocy, donate to Public.Resource.Org.

Learning Data Science Using Functional Python

Sunday, July 26th, 2015

Learning Data Science Using Functional Python by Joel Grus.

Something fun to start the week off!

Apologies for the “lite” posting of late. I am munging some small but very ugly data for a report this coming week. The data sources range from spreadsheets to forms delivered in PDF, in no particular order and some without the original numbering. What fun!

Complaints about updating URLs that were redirects were meet with replies that “private redirects” weren’t of interest and they would continue to use the original URLs. Something tells me the responsible parties didn’t quite get what URL redirects are about.

Another day or so and I will be back at full force with more background on the Balisage presentation and more useful posts every day.

Black Hat USA 2015

Saturday, July 25th, 2015

I know you already have registered, etc. but that is one presentation I hope you catch and blog about:


TrackingPoint is an Austin startup known for making precision-guided firearms. These firearms ship with a tightly integrated system coupling a rifle, an ARM-powered scope running a modified version of Linux, and a linked trigger mechanism. The scope can follow targets, calculate ballistics and drastically increase its user’s first shot accuracy. The scope can also record video and audio, as well as stream video to other devices using its own wireless network and mobile applications.

In this talk, we will demonstrate how the TrackingPoint long range tactical rifle works. We will discuss how we reverse engineered the scope, the firmware, and three of TrackingPoint’s mobile applications. We will discuss different use cases and attack surfaces. We will also discuss the security and privacy implications of network-connected firearms.

TrackingPoint should get security points for not basing their product on Windows XP.


Lessons learned on Linux-based systems should be applicable to weaker operating systems as well.

Enjoy the conference!

Sora high performance software radio is now open source

Saturday, July 25th, 2015

Sora high performance software radio is now open source by Jane Ma.

From the post:

Microsoft researchers today announced that their high-performance software radio project is now open sourced through GitHub. The goal for Microsoft Research Software Radio (Sora) is to develop the most advanced software radio possible, capable of implementing the latest wireless communication technology easily and efficiently.

"We believe that a fully open source Sora will better support the research community on more scientific innovation," said Kun Tan, a senior research on the software radio project team.

Conventionally, the critical lower layer processing in wireless communication systems, i.e., the physical layer (PHY) and medium access control (MAC), are typically implemented in hardware (ASIC chips), due to high-computational and real-time requirements. However, designing ASIC is very costly and inflexible since ASIC chips are fixed. Once delivered, it cannot be changed or upgraded. The lack of flexibility and programmability makes experimental research in wireless communication very difficult. Software Radio (or SDR), on the contrary, proposes implementing all these low-level PHY and MAC processes through software, which is practical for development, debugging and updating. The challenge, however, is how the software can stay up to date with hardware in terms of performance.

See also: Microsoft's Wireless and Networking research group

Sora was developed to solve this significant challenge. Sora is a fully programmable high-performance software radio that is capable of implementing state-of-the-art wireless technologies (Wi-Fi, LTE, MIMO, etc.). Sora is based on software running on a low-cost, commodity multi-core PC with a general purpose OS, i.e., Windows. A multi-core PC, plugged in to a PCIe radio control board, connecting to a third-party radio front-end with antenna, becomes a powerful software radio platform. The PC interface board transfers the raw wireless (I/Q) signals between the RF front-end and the PC memory through fast DMA. All signals are processed in the software running in the PC.

An avalanche of wireless signals will accompanying the Internet of Things (IoT). Intercepting all of them with custom hardware would be prohibitively expensive.

Thanks to Microsoft, you can skip the custom hardware step.

Remember: The question is who is listening?, not if?.

Programming Languages Used for Music

Friday, July 24th, 2015

Programming Languages Used for Music by Tim Thompson.

From the history page:

The PLUM (Programming Languages Used for Music) list is maintained as a service for those who, like me, are interested in programming languages that are used for a musical purpose. The initial content was based on a list that Carter Scholz compiled and posted to netnews in 1991.

There are a wide variety of languages used for music. Some are conventional programming languages that have been enhanced with libraries and environments for use in musical applications. Others are specialized languages developed explicitly for music.

The focus of entries in this list is on the languages, and not on particular applications. In other words, a musical application written in a particular programming language does not immediately qualify for inclusion in the list, unless the application is specifically intended to enhance the use of that language for musical development by other programmers.

Special thanks go to people who have provided significant comments and information for this list: Bill Schottstaedt, Dave Phillips, and Carter Scholz.

Corrections to existing entries and suggestions for improving the list should be mailed to the PLUM maintainer:

Tim Thompson
Home Page:

If you are experimenting with Clojure and music, these prior efforts may be inspirational.


How to Spot an Extremist

Friday, July 24th, 2015


Cameron unveils plan to tackle extremism in UK

How you look is a matter of heredity so I probably should not say that David Cameron “looks like” an extremist. (Even if he does.)

Why resort to appearances when Cameron easily convinces any rational listener that he is an extremist, both in print and on radio?

Consider what Cameron says awaits young people who join the Islamic State:

“If you are a boy, they will brainwash you, strap bombs to your body and blow you up. If you are a girl, they will enslave and abuse you. That is the sick and brutal reality of ISIL.

What Cameron does not say is that if you stay in the UK:

If you are a boy, they will brainwash you and have you kill innocent women and children with highly sophisticated drones. You will inflict harm on people who wish your country would leave them alone. Your efforts will support Cameron and his trainers playing out the Game of Thrones in the Middle East.

If you are a girl, you will be enslaved by a consumer culture built on making you feel insecure and frightened, you will be sexually harassed both in school and at work. You will eventually appreciate being second-class citizen in a vassal state with poor cooking.

That is the sick and brutal reality of the UK.

No, I’m still not a supporter of the Islamic State but I do see David Cameron as a deluded extremist.

Left to their own devices, the people of the Middle East are capable of choosing or deposing whatever governments they want. Something that the West is unwilling to allow to happen. Do you wonder why?

Exploring the Enron Spreadsheet/Email Archive

Thursday, July 23rd, 2015

I forgot to say yesterday that if you cite the work of Felienne Hermans and Emerson Murphy-Hill Enron archive, use this citation:

  author    = {Felienne Hermans and
               Emerson Murphy-Hill},
  title     = {Enron's Spreadsheets and Related Emails: A Dataset and Analysis},
  booktitle = {37th International Conference on Software Engineering, {ICSE} '15},
  note     =  {to appear}

A couple of interesting tidbits from this morning.

Non-Matching Spreadsheet Names

If you look at:

(local)/84_JUDY_TOWNSEND_000_1_1.PST/townsend-j/JTOWNSE (Non-Privileged)/Inbox/_1687004514.eml

You will find that (sender), sent an email with Tport Max Rates Calculations 10-27-01.xls attached, to fletcjv@NU.COM and cc:ed “Concannon” and “Townsend” . (Potential subjects in bold.)

I selected this completely at random, save for finding an email that using the word “spreadsheet.”

If you look in the spreadsheet archive, you will not find “Tport Max Rates Calculations 10-27-01.xls,” at least not by that name. You will find: “judy_townsend__17745__Tport Max Rates Calculations 10-27-01.xlsx.”

I don’t know when that conversion took place but thought it was worth noting. BTW, the spreadsheet archive has 15,871 .xlsx files and 58 .xls files. Michelle Lokay has thirty-two of the fifty-eight (58) .xls files but they all appear to be duplicated by files with the .xlsx extension.

Given the small number, I suspect an anomaly in a bulk conversion process. When I do group operations on the spreadsheets I will be using the .xlsx extension only to avoid duplicates.

Dirty, Very Dirty Data

I was just randomly opening spreadsheets when I encountered this jewel:


Using rows to format column headers. There are worse examples, try:


No columns headers at all! (On this tab.)

I am beginning to suspect that the conversion to .xslx format was to enable the use of better tooling to explore the originally .xls files.

Be sure to register for Balisage 2015 if you want to see the outcome of all this running around!

Tomorrow I think we are going to have a conversation about indexing email with Solr. Having all 15K spreadsheets doesn’t tell me which ones were spoken of the most often in email.

Enron, Spreadsheets and 7z

Wednesday, July 22nd, 2015

Sam Hunting and I are working on a presentation for Balisage that involves a subset of the Enron dataset focused on spreadsheets.

You will have to attend Balisage to see the floor show but I will be posting notes about our preparations for the demo under the category Enron and/or Spreadsheets.

Origin of the Enron dataset on Spreadsheets

First things first, the subset of the Enron dataset focused on spreadsheets was announced by Felienne Hermans in A modern day Pompeii: Spreadsheets at Enron.

The data set: Hermans, Felienne (2014): Enron Spreadsheets and Emails. figshare.

Feilienne has numerous presentations and publications on spreadsheets and issues with spreadsheets.

I have always thought of spreadsheets as duller versions of tables.

Felienne, on the other hand, has found intrigue, fraud, error, misunderstanding, opacity, and the usual chicanery of modern business practice.

Whether you want to “understand” a spreadsheet depends on whether you need plausible deniability or if you are trying to detect efforts at plausible deniability. Auditors for example.

Felienne’s Enron spreadsheet data set is a great starting point for investigating spreadsheets and their issues.

Unpacking the Archives with 7z

The email archive comes in thirteen separate files, eml.7z.001 – eml.7z.013.

At first I tried to use 7z to assemble the archive, decompress it and grep the results without writing it out. No go.

On a subsequent attempt, just unpacking the multi-part file, a message appeared announcing a name conflict and asking what to do with the conflict.

IMPORTANT POINT: Thinking I don’t want to lose any data, I foolishly said to rename files to avoid naming conflicts.

You are probably laughing at this point because you can see where this is going.

The command I used to first extract the files reads: 7z e eml.7z.001 (remembering that in the case of name conflicts I said to rename the conflicting file).

But if you use 7z e, all the files are written to a single directory. Which of course means for every single file write, it has to check for conflicting file names. Opps!

After more than twenty-four (24) hours of ever slowing output (# of files was at 528,000, approximately), I killed the process and took another path.

I used 7z x eml.7z001 (correct command), which restores all of the original directories and therefore there are no file name conflicts. File writing I/O jumped up to 20MB/sec+, etc.

Still took seventy-eight (78) minutes to extract but there were other heavy processes going on at the same time.

Like deleting the 528K+ files in the original unpacked directory. Did you know that rm has an argument limit? I’m sure you won’t encounter it often but it can be a real pain when you do. I was deleting all the now unwanted files from the first run when I encountered it.

A shell limitation according to: Argument List Too Long. A 128K limit to give you an idea of the number of files you need to encounter before hitting this issue.

The Lesson

Unpack the Enron email archive with: 7z x eml.7z.001.

Tomorrow I will be posting about using Unix shell tools to explore the email data.

PS: Register for Balisage today!

Road Rage with Flair!

Wednesday, July 22nd, 2015

Zero-day in Fiat Chrysler feature allows remote control of vehicles by Robert Abel.

From the post:

Fiat Chrysler owners should update their vehicles’ software after a pair of security researchers were able to exploit a zero-day vulnerability to remotely control the vehicle’s the engine, transmission, wheels and brakes among other systems.

Chris Valasek, director of vehicle security at IOActive, and security researcher Charlie Miller, a member of the company’s advisory board, said the vulnerability was found in late 2013 to 2015 models that have the Uconnect feature, according to Wired.

Anyone who knows who knows the car’s IP address may gain access to a vulnerable vehicle through its cellular connection. Attackers can then target a chip in the vehicle’s entertainment hardware unit to rewrite its firmware to send commands to internal computer networks controlling physical components.

If that sounds bad, you really need to read the Wired article Hackers Remotely Kill a Jeep on the Highway—With Me in It by Andy Greenberg.

Here’s a paragraph from the Wired article to get you hooked:

Though I hadn’t touched the dashboard, the vents in the Jeep Cherokee started blasting cold air at the maximum setting, chilling the sweat on my back through the in-seat climate control system. Next the radio switched to the local hip hop station and began blaring Skee-lo at full volume. I spun the control knob left and hit the power button, to no avail. Then the windshield wipers turned on, and wiper fluid blurred the glass.

You won’t be disappointed because the hack continues onto the transmission, brakes, steering (not perfected, yet) and other systems.

Hard to say when this will appear as routine download with a nice GUI. Perhaps with automatic display of prospective targets within visual range.

The upside is a resurgence of interest in classic cars.

Your security status will be reflected in the lack of remotely controllable devices.

For the truly security conscious, secretaries may replace voice dictation systems on vulnerable networks.

Ibis on Impala: Python at Scale for Data Science

Tuesday, July 21st, 2015

Ibis on Impala: Python at Scale for Data Science by Marcel Kornacker and Wes McKinney.

From the post:

Ibis: Same Great Python Ecosystem at Hadoop Scale

Co-founded by the respective architects of the Python pandas toolkit and Impala and now incubating in Cloudera Labs, Ibis is a new data analysis framework with the goal of enabling advanced data analysis on a 100% Python stack with full-fidelity data. With Ibis, for the first time, developers and data scientists will be able to utilize the last 15 years of advances in high-performance Python tools and infrastructure in a Hadoop-scale environment—without compromising user experience for performance. It’s exactly the same Python you know and love, only at scale!

In this initial (unsupported) Cloudera Labs release, Ibis offers comprehensive support for the analytical capabilities presently provided by Impala, enabling Python users to run Big Data workloads in a manner similar to that of “small data” tools like pandas. Next, we’ll extend Impala and Ibis in several ways to make the Python ecosystem a seamless part of the stack:

  • First, Ibis will enable more natural data modeling by leveraging Impala’s upcoming support for nested types (expected by end of 2015).
  • Second, we’ll add support for Python user-defined logic so that Ibis will integrate with the existing Python data ecosystem—enabling custom Python functions at scale.
  • Finally, we’ll accelerate performance further through low-level integrations between Ibis and Impala with a new Python-friendly, in-memory columnar format and Python-to-LLVM code generation. These updates will accelerate Python to run at native hardware speed.

See: Getting Started with Ibis and How to Contribute (same authors, opposite order) in order to cut to the chase and get started.


Migrate or Lose Control of Your Windows XP/Server 2003 System

Tuesday, July 21st, 2015

Microsoft words it:

Vulnerability in Microsoft Font Driver Could Allow Remote Code Execution (3079904).

But later makes the danger a little clearer:

A remote code execution vulnerability exists in Microsoft Windows when the Windows Adobe Type Manager Library improperly handles specially crafted OpenType fonts. An attacker who successfully exploited this vulnerability could take complete control of the affected system. An attacker could then install programs; view, change, or delete data; or create new accounts with full user rights.

There are multiple ways an attacker could exploit this vulnerability, such as by convincing a user to open a specially crafted document, or by convincing a user to visit an untrusted webpage that contains embedded OpenType fonts. The update addresses the vulnerability by correcting how the Windows Adobe Type Manager Library handles OpenType fonts.

When this security bulletin was issued, Microsoft had information to indicate that this vulnerability was public but did not have any information to indicate this vulnerability had been used to attack customers. Our analysis has shown that exploit code could be created in such a way that an attacker could consistently exploit this vulnerability. (emphasis added)

Show of hands: How many of you visit untrusted sites with embedded OpenType fonts?

Microsoft rates this critical and for all versions of Windows.

No patch has been issued for Windows XP or Windows Server 2003.

Get Smarter About Apache Spark

Tuesday, July 21st, 2015

Get Smarter About Apache Spark by Luis Arellano.

From the post:

We often forget how new Spark is. While it was invented much earlier, Apache Spark only became a top-level Apache project in February 2014 (generally indicating it’s ready for anyone to use), which is just 18 months ago. I might have a toothbrush that is older than Apache Spark!

Since then, Spark has generated tremendous interest because the new data processing platforms scales so well, is high performance (up to 100 times faster than alternatives), and is more flexible than other alternatives, both open source and commercial. (If you’re interested, see the trends on both Google searches and Indeed job postings.)

Spark gives the Data Scientist, Business Analyst, and Developer a new platform to manage data and build services as it provides the ability to compute in real-time via in-memory processing. The project is extremely active with ongoing development, and has serious investment from IBM and key players in Silicon Valley.

Luis has collected up links for absolute beginners, understanding the basics, intermediate learning and finally reaching the expert level.

None of the lists are overwhelming so give them a try.

Top Ten #ddj

Tuesday, July 21st, 2015

Top Ten #ddj: The Week’s Most Popular Data Journalism Links

From the post:

What’s the data-driven journalism crowd tweeting? Here are the top ten links for July 9 to 16: +300 sites with free geographic datasets (@sciremotesense); graphing German YouTube (@SPIEGELOLINE); Australia’s mining footprint (@ICIJorg); democratizing data (OKFN); and more.


Occupy London As Terrorists?

Tuesday, July 21st, 2015

The City of London is so bereft of terrorists or terrorist-like activity that Occupy London has been drafted as terrorists.

Not having any credible terrorist threat could certainly endanger police and security service budgets so it isn’t hard to understand why the London Police acting quickly in designating Occupy London as a terrorist group.

Although, a “terrorist group” that obeys court injunctions as to where they can assemble doesn’t strike me as a particularly dangerous group of “terrorists.”

Appropriately, the police initiative is called Project Fawn. Whether the Fawn refers the fawning of sycophants for funding to combat non-existent terrorism or has some other reference isn’t clear.

Charlie Hebdo is cited as an “attack,” although only twelve (12) people died, as opposed to the London average homicide rate of 171 per year since 1990.

And its not “terrorism” but on average seven (7) spouses die every month in London from domestic violence. Domestic violence: One month’s death toll.

Why a child hearing/seeing a parent beaten to death isn’t a priority over non-existent terrorism is difficult for a foreign observer to say.

On the whole, the real terrorists in London occupy offices in the London Police department and other security services. Something should be done to turn them out, root and branch.

Inside the Secret World of Russia’s Cold War Mapmakers

Tuesday, July 21st, 2015

Inside the Secret World of Russia’s Cold War Mapmakers by Greg Miller.

From the post:

A MILITARY HELICOPTER was on the ground when Russell Guy arrived at the helipad near Tallinn, Estonia, with a briefcase filled with $250,000 in cash. The place made him uncomfortable. It didn’t look like a military base, not exactly, but there were men who looked like soldiers standing around. With guns.

The year was 1989. The Soviet Union was falling apart, and some of its military officers were busy selling off the pieces. By the time Guy arrived at the helipad, most of the goods had already been off-loaded from the chopper and spirited away. The crates he’d come for were all that was left. As he pried the lid off one to inspect the goods, he got a powerful whiff of pine. It was a box inside a box, and the space in between was packed with juniper needles. Guy figured the guys who packed it were used to handling cargo that had to get past drug-sniffing dogs, but it wasn’t drugs he was there for.

Inside the crates were maps, thousands of them. In the top right corner of each one, printed in red, was the Russian word секрет. Secret.

The maps were part of one of the most ambitious cartographic enterprises ever undertaken. During the Cold War, the Soviet military mapped the entire world, parts of it down to the level of individual buildings. The Soviet maps of US and European cities have details that aren’t on domestic maps made around the same time, things like the precise width of roads, the load-bearing capacity of bridges, and the types of factories. They’re the kinds of things that would come in handy if you’re planning a tank invasion. Or an occupation. Things that would be virtually impossible to find out without eyes on the ground.

Given the technology of the time, the Soviet maps are incredibly accurate. Even today, the US State Department uses them (among other sources) to place international boundary lines on official government maps.

If you like stories of the intrigue of the Cold War and of maps, Greg’s post was made for you.

The maps have been rarely studied but one person is trying to change that:

But one unlikely scholar, a retired British software developer named John Davies, has been working to change that. For the past 10 years he’s been investigating the Soviet maps, especially the ones of British and American cities. He’s had some help, from a military map librarian, a retired surgeon, and a young geographer, all of whom discovered the maps independently. They’ve been trying to piece together how they were made and how, exactly, they were intended to be used. The maps are still a taboo topic in Russia today, so it’s impossible to know for sure, but what they’re finding suggests that the Soviet military maps were far more than an invasion plan. Rather, they were a framework for organizing much of what the Soviets knew about the world, almost like a mashup of Google Maps and Wikipedia, built from paper.

I don’t know any more about Soviet maps that you can gain from reading this article but the line:

they were a framework for organizing much of what the Soviets knew about the world, almost like a mashup of Google Maps and Wikipedia, built from paper.

Has some of the qualities that I associate with topic maps. Granting it chooses a geographic frame of reference but every map has some frame of reference, stated or unstated.

It would make a great paper on topic maps to represent the knowledge of an old-style Soviet map as a topic map.

As a resource, John Davies maintains a comprehensive website about Soviet maps.

Fusing Narrative with Graphs

Tuesday, July 21st, 2015

Quest for a Narrative Representation of Power Relations by Lambert Strether.

Lamber is looking to meet these requirements:

  1. Be generated algorithmically from data I control….
  2. Have narrative labels on curved arcs. … The arcs must be curved, as the arcs in Figure1 are curved, to fit the graph within the smallest possible (screen) space.
  3. Be pretty. There is an entire literature devoted to making “pretty” graph, starting with making sure arcs don’t cross each other….

The following map was hand crafted and it meets all the visual requirements:


Check out the original here.

Lambert goes on a search for tools that come close to this presentation and also meet the requirements set forth above.

The idea of combining graphs with narrative snippets as links is a deeply intriguing one. Rightly or wrongly I think of it as illustrated narrative but without the usual separation between those two elements.


A history of backdoors

Tuesday, July 21st, 2015

A history of backdoors by Matthew Green (Cryptographer and research professor at John Hopkins University).

I commend this review of the history of backdoors to anyone interested in the technical issues and why “exceptional access” isn’t a workable idea.

I do not recommend it if you are debating pro-government access advocates because they are wrapped in an armor of invincible ignorance. You are wasting your time offering them contrary facts and opinions of genuine experts.

Your time is better spent documenting all the lies told by government officials in a policy area. When debating “exceptional access” point out the government and its minions are completely unworthy of belief.

Humor is a far better weapon than facts because the government advocates think of themselves as serious people, entrusted with defending civilization itself. Rather than the 24×7 news cycles solemnly intoning their latest positions, there should be a Punch and Judy show that reports their positions.

And mocks their positions as well. Few things are sharper than well crafted humor. Especially when set to music so everytime a dunce proposal like “Clipper” comes up, you think of:


That’s right, the Clipper chip is to you as clipping Cupid’s wings are to him. (Pierre Mignard (1610-1695) – Time Clipping Cupid’s Wings (1694) Pierre Mignard [Public domain], via Wikimedia Commons)

If you keep that firmly in mind, how much respect can you have for the Clipper chip and similar proposals?

Phishing Job Applicants [Phishing Rating Service?]

Tuesday, July 21st, 2015

Swati Khandelwal writes in: Phishing Your Employees: Clever way to promote Cyber Awareness that:

A massive 91% of successful data breaches at companies started with a social engineering and spear-phishing attack. A phishing attack usually involves an e-mail that manipulates a victim to click on a malicious link that could then expose the victim’s computer to a malicious payload.

Phish your Employees!

Yes, you heard me right… by this I mean that you should run a mock phishing campaign in your organization and find out which employees would easily fall victim to the phishing emails. Then step everyone through Internet Security Awareness Training.

Great idea but we can do better than that!

Phish your job applicants!

You can rank your current applicants by their vulnerability to phishing and in the long term, develop a phishing scale for all applicants.

Those that fail, you don’t call for an interview.

Any more than you would install a doorway into your corporate offices without a door.

Has anyone proposed a phishing rating service? Like a credit rating but it rates how likely you are to be a victim of phishing?

PS: I know your CEO and his buddies will fail the same test but the trick is to catch them before they become CE0s, etc.