Archive for May, 2015

I Is For Identifier

Thursday, May 28th, 2015

As you saw yesterday, Sam Hunting and I have a presentation at Balisage 2015 (Wednesday, August 12, 2015, 9:00 AM, if you are buying a one-day ticket), “Spreadsheets – 90+ million end user programmers with no comment tracking or version control.”

If you suspect the presentation has something to do with topic maps, take one mark for your house!

You will have to attend the conference to get the full monty but there are some ideas and motifs that I will be testing here before incorporating them into the paper and possibly the presentation.

The first one is a short riff on identifiers.

Omitting the hyperlinks, the Wikipedia article in identifiers says in part:

An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique class of objects, where the “object” or class may be an idea, physical [countable] object (or class thereof), or physical [noncountable] substance (or class thereof). The abbreviation ID often refers to identity, identification (the process of identifying), or an identifier (that is, an instance of identification). An identifier may be a word, number, letter, symbol, or any combination of those.

(emphasis in original)

It goes on to say:

In computer science, identifiers (IDs) are lexical tokens that name entities. Identifiers are used extensively in virtually all information processing systems. Identifying entities makes it possible to refer to them, which is essential for any kind of symbolic processing.

There is an interesting shift in that last quote. Did you catch it?

The first two sentences are talking about identifiers but the third shifts to “[i]identifying entities makes it possible to refer to them….” But single token identifiers aren’t the only means to identify an entity.

For example, a police record may identify someone by their Social Security Number and permit searching by that number, but it can also identify an individual by height, weight, eye/hair color, age, tatoos, etc.

But we have been taught from a very young age that I stands for Identifier, a single token that identifies an entity. Thus:


Single identifiers are found in “virtually all information systems,” not to mention writing from all ages and speech as well. They save us a great deal of time by allowing us to say “President Obama” without having to enumerate all the other qualities that collectively identify that subject.

Of course, the problem with single token identifiers is that we don’t all use the same ones and sometimes use the same ones for different things.

So long as we remain fixated on bare identifiers:


we will continue to see efforts to create new “persistent” identifiers. Not a bad idea for some purposes, but a rather limited one.

Instead of bare identifiers, what if we understood that identifiers stand in the place of all the qualities of the entities we wish to identify?

That is our identifiers were seen as being pregnant with the qualities of the entities they represent:


For some purposes, like unique keys in a database, our identifiers can be seen as opaque identifiers, that’s all there is to see.

For other purposes, such as indexing across different identifiers, then our identifiers are pregnant with the qualities that identify the entities they represent.

If we look at the qualities of the entities represented by two or more identifiers, we may discover that the same identifier represents two different entities, or we may discover that two (or more) identifiers represent the same entities.

I think we need to acknowledge the allure of bare identifiers (the ones we think we understand) and their usefulness in many circumstances. We should also observe that identifiers are in fact pregnant with the qualities of the entities they represent, enabling use to distinguish the same identifier but different entity case and match different identifiers for the same entity.

Which type of identifier you need, bare or pregnant, depends upon your use case and requirements. Neither one is wholly suited for all purposes.

(Comments and suggestions are always welcome but especially on these snippets of material that will become part of a larger whole. On the artwork as well. I am trying to teach myself Gimp.)

Cybersecurity: Authoritative Reports and Resources, by Topic [Why We Need Librarians]

Wednesday, May 27th, 2015

Cybersecurity: Authoritative Reports and Resources, by Topic by Rita Tehan, Information Specialist (Congressional Research Service).

From the summary:

This report provides references to analytical reports on cybersecurity from CRS, other government agencies, trade associations, and interest groups. The reports and related websites are grouped under the following cybersecurity topics:

  • Policy overview
  • National Strategy for Trusted Identities in Cyberspace (NSTIC)
  • Cloud computing and the Federal Risk and Authorization Management Program (FedRAMP)
  • Critical infrastructure
  • Cybercrime, data breaches, and data security
  • National security, cyber espionage, and cyberwar (including Stuxnet)
  • International efforts
  • Education/training/workforce
  • Research and development (R&D)
    • In addition, the report lists selected cybersecurity-related websites for congressional and government agencies; news; international organizations; and other organizations, associations, and institutions.

Great report on cybersecurity resources!

As well as a killer demo for why we need librarians, now more than ever.

Here’s the demo. Print and show the coverpage of the report to a library doubter.

Let them pick a category from the table of contents and then you count the number of federal government resources in the report. Give them a week to duplicate the contents of the section they have chosen. 😉

Anyone, including your colleagues, can find something relevant on the WWW. The question is whether they can find all the good stuff.

(Sorry, librarians and search experts are no eligible for this challenge.)

I first saw this in Gary Price’s The Research Desk.

Balisage 2015 Program Is Out!

Wednesday, May 27th, 2015

Balisage 2015 Program

Tommie Usdin posted this message announcing the Balisage 2015 program:

I think this is an especially strong Balisage program with a good mix of theoretical and practical. The 2015 program includes case studies from journal publishing, regulatory compliance systems, and large-scale document systems; formatting XML for print and browser-based print formatting; visualizing XML structures and documents. Technical papers cover such topics as: MathML; XSLT; use of XML in government and the humanities; XQuery; design of authoring systems; uses of markup that vary from poetry to spreadsheets to cyber justice; and hyperdocument link management.

Good as far as it goes but a synopsis (omitting blurbs and debauchery events) of the program works better for me:

  • The art of the elevator pitch B. Tommie Usdin, Mulberry Technologies
  • Markup as index interface: Thinking like a search engine Mary Holstege, MarkLogic
  • Markup and meter: Using XML tools to teach a computer to think about versification David J. Birnbaum, Elise Thorsen, University of Pittsburgh
  • XML (almost) all the way: Experiences with a small-scale journal publishing system Peter Flynn, University College Cork
  • The state of MathML in K-12 educational publishing Autumn Cuellar, Design Science Jean Kaplansky, Safari Books Online
  • Diagramming XML: Exploring concepts, constraints and affordances Liam R. E. Quin, W3C
  • Spreadsheets – 90+ million end user programmers with no comment tracking or version control Patrick Durusau Sam Hunting
  • State chart XML as a modeling technique in web engineering Anne
    , Marouane Sayih, Zlatina Cheva, Technische Universität München
  • Implementing a system at US Patent and Trademark Office to fully automate the conversion of filing documents to XML Terrel Morris, US Patent and Trademark Office Mark Gross, Data Conversion Laboratory Amit Khare, CGI Federal
  • XML solutions for Swedish farmers: A case study Ari Nordström, Creative Words
  • XSDGuide — Automated generation of web interfaces from XML schemas: A case study for suspicious activity reporting Fabrizio Gotti, Université de Montréal Kevin Heffner, Pegasus Research & Technologies Guy Lapalme, Université de Montréal
  • Tricolor automata C. M. Sperberg-McQueen, Black Mesa Technologies; Technische Universität Darmstadt
  • Two from three (in XSLT) John Lumley, jωL Research / Saxonica
  • XQuery as a data integration language Hans-Jürgen Rennau, Traveltainment Christian Grün, BaseX
  • Smart content for high-value communications David White, Quark Software
  • Vivliostyle: An open-source, web-browser based, CSS typesetting engine Shinyu Murakami, Johannes Wilm, Vivliostyle
  • Panel discussion: Quality assurance in XML transformation
  • Comparing and diffing XML schemas Priscilla Walmsley, Datypic
  • Applying intertextual semantics to cyberjustice: Many reality checks for the price of one Yves Marcoux, Université de Montréal
  • UnderDok: XML structured attributes, change tracking, and the metaphysics of documents Claus Huitfeldt, University of Bergen, Norway
  • Hyperdocument authoring link management using Git and XQuery in service of an abstract hyperdocument management model applied to DITA hyperdocuments Eliot Kimber, Contrext
  • Extending the cybersecurity digital thread with XForms Joshua Lubell, National Institute of Standards and Technology
  • Calling things by their true names: Descriptive markup and the search for a perfect language C. M. Sperberg-McQueen, Black Mesa Technologies; Technische Universität Darmstadt

Now are you ready to register and make your travel arrangements?

Disclaimer: I have no idea why the presentation: Spreadsheets – 90+ million end user programmers with no comment tracking or version control is highlighted in your browser. Have you checked your router for injection attacks by the NSA? 😉

PS: If you are doing a one-day registration, the Spreadsheets presentation is Wednesday, August 12, 2015, 9:00 AM. Just saying.

Ephemeral identifiers for life science data

Wednesday, May 27th, 2015

10 Simple rules for design, provision, and reuse of persistent identifiers for life science data by Julie A. McMurray, et al. (35 others).

From the introduction:

When we interact, we use names to identify things. Usually this works well, but there are many familiar pitfalls. For example , the “morning star” and “evening star” are both names for the planet Venus. “The luminiferous ether” is a name for an entity which no one still thinks exists. There are many women named “Margaret”, some of whom go by “Maggie” and some of whom have changed their surnames. We use everyday conversational mechanisms to work around these problems successfully. Naming problems have plagued the life sciences since Linnaeus pondered the Norway spruce; in the much larger conversation that underlies the life sciences, problems with identifiers (Box 1) impede the flow and integrity of information. This is especially challenging within “synthesis research” disciplines such as systems biology, translational medicine, and ecology. Implementation – driven initiatives such as ELIXIR , BD2K, and others (Text S1) have therefore been actively working to understand and address underlying problems with identifiers.

Good, global-scale, persistent identifier design is harder than it appears, and is essential for data to be Findable, Accessible, Interoperable, and Reusable (Data FAIRport principles [1]). Digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Identifiers are further complicated by imprecise terminology and different forms (Box 1).

Of the identifier forms, Local Resource Identifiers (LRI) and their corresponding full Uniform Resource Identifiers (URIs) are still among the most commonly used and most problematic identifiers in the bio-data ecosystem. Other forms of identifiers such as Uniform Resource Name (URNs) are less impactful because of their current lack of uptake. Here, we build on emerging conventions and existing general recommendations [2,3] and summarise the identifier characteristics most important to optimising the flow and integrity of life-science data (Table 1). We propose actions to take in the identifier ‘green field’ and offer guidance for using real-world identifiers from diverse sources.

Truth be told, global, persistent identifier design is overreaching.

First, some identifiers are more widely used than others, but there are no globally accepted identifiers of any sort.

Second, “persistent” is undefined. Present identifiers (curies or URIs) have not persisted pre-Web identifiers. On what basis would you claim that future generations will persist our identifiers?

However, systems expect to be able to make references by single, opaque, identifiers and so the hunt goes on for a single identifier.

The more robust and in fact persistent approach is to have a bag of identifiers for any subject, where each identifier itself has a bag of properties associated with it.

That avoids the exclusion of old identifiers and hence historical records and avoids pre-exclusion of future identifiers, which come into use long after our identifier is no long the most popular one.

Systems can continue to use a single identifier, locally as it were but software where semantic integration is important, should use sets of identifiers to facilitate integration across data sources.

Speaking Truth To Power (sort of)

Wednesday, May 27th, 2015

16 maps that Americans don’t like to talk about by Max Fisher.

Max lists the following maps:

  1. The US was built on the theft of Native American’s lands
  2. The Trail of Tears, one of the darkest moments in US history — and we rarely talk about it
  3. America’s indigenous population today is sparse and largely lives in areas we forced them into
  4. America didn’t just tolerate slavery for a century — we expanded it
  5. This 1939 map of redlining in Chicago is just a hint at the systematic discrimination against African Americans
  6. School segregation is still a terrible problem
  7. Kids born poor have almost no chance at achieving the American Dream
  8. American has the second-highest child poverty rate in the developed world
  9. The US ranks alongside Nigeria on income inequality
  10. The US tried to replace Spain as an imperialist power
  11. The US outright stole Hawaii as part of its Pacific colonialism
  12. The firebombing that devastated Japan — including lots of non-military targets
  13. Agent Orange: the chemical we used to destroy a generation in Vietnam and harm our own troops
  14. The US backed awful dictators and insurgencies of the Cold War
  15. The thousands of Iraqi civilian deaths in the Iraq War
  16. Syria’s refugee crisis; the humanitarian catastrophe we could still help address but won’t

As far as Max’s maps:

Truthful? Yes.

Informative? Yes.

Not widely known? In some cases.

Will result in different outcomes? No so far.

The repetition of these narratives is part and parcel of Chompsky’s Propaganda System that we were discussing yesterday.

People make entire careers at keeping old injustices alive. Taking up historical causes is safe because the past is beyond our ability to change. You don’t want to be the March of Dimes when they discover a cure for polio.

Is bringing up old injustices speaking truth to power? After some amount of discussion, those in power will stop pretending to pay attention, a majority of citizens will lose interest (until next time) and present injustices, will continue without effort or change.

Ask yourself, whose interest does distraction from current injustices serve?

Power can tolerate a lot of truth, so long as it is beyond being changed by anyone. The crowd can vent its righteous anger, speeches can be made, marches held, and other for cleaning up after crowds, the system grinds on.

PS: On Syrian refugees, Saudi Arabia is a lot closer than the United States and the oil states of the Middle East have the resources to more than adequately care for Syrian refugees. US involvement will only continue its tradition of weak/corrupt governments in the Middle East.

U.S. sides with Oracle in Java copyright dispute with Google

Wednesday, May 27th, 2015

U.S. sides with Oracle in Java copyright dispute with Google by John Ribeiro.

From the post:

The administration of President Barack Obama sided with Oracle in a dispute with Google on whether APIs, the specifications that let programs communicate with each other, are copyrightable.

Nothing about the API (application programming interface) code at issue in the case materially distinguishes it from other computer code, which is copyrightable, wrote Solicitor General Donald B. Verrilli in a filing in the U.S. Supreme Court.

The court had earlier asked for the government’s views in this controversial case, which has drawn the attention of scientists, digital rights group and the tech industry for its implications on current practices in developing software.

Although Google has raised important concerns about the effects that enforcing Oracle’s copyright could have on software development, those concerns are better addressed through a defense on grounds of fair use of copyrighted material, Verrilli wrote.

Neither the ScotusBlog case page, Google Inc. v. Oracle America, Inc., nor the Solitor General’s Supreme Court Brief page, as of May 27, 2015, has a copy of the Solicitor General’s brief.

I hesitate to comment on the Solicitor General’s brief sight unseen, as media reports on legal issues are always vague and frequently wrong.

Whatever Solicitor General Verrilli may or may not have said to one side, software interoperability should be the default, not something established by affirmative defenses. Public policy should encourage interoperability of software.

Consumers, large and small, should be aware that reduction of interoperability between software means higher costs for consumers. Something to keep in mind when you are looking for a vendor.

So You Need To Hire A Hacker?

Wednesday, May 27th, 2015

Graham Cluley points out a simple hack in Hacker’s List leaks its secrets, revealing true identities of those wanting to hack that matches 25% of the requests to Facebook accounts.

Well, it is a site for people who need to hire hackers so their lack of security skills isn’t surprising.

Cluley and others make much of the usual requests being unlawful. True but given the conduct of the government and the average business, I’m not sure why that is a trump card.

The better reason to avoid pedestrian offers to undertake illegal activity is:

On the Internet, no one knows you are the FBI (or other law enforcement agency).

It’s not much better in real life. Every publicized “murder for hire” in Georgia for the last year involved the hiring of undercover police officers. If someone you don’t already know offers to kill someone for money, it is nearly certain they are a police officer.

And the FBI is scurrying around in real life trying to find people who make good terrorist defendants. Beware of people offering to procure explosives, instructions on explosives, providing transportation, urging you to make statements about causes in the Middle East in general or ISIS in particular.

The best way to start reading current news from a jail cell is to start telling others what you want to do about items in the news. It’s lawful from some groups or opinions to glory in wholesale slaughter of innocents, not for others. If you are in the “not for others” group, don’t throw away your life by letting your mouth overload your brain. That won’t benefit anyone.

Bear in mind that revolutions aren’t won by narcissists who get on the evening news. No, revolutions are won by those who labor in obscurity knowing their cause is just and will prevail. Sometimes that leads to newsworthy events, but they don’t seek them out.

Besides, not seeking publicity drives governments crazy. They are certain that danger lurks under every bush and shrub anyway. Not being able to find real danger makes them all the more frantic.

Report: IRS hacked, tax info stolen for 100,000+

Tuesday, May 26th, 2015

Report: IRS hacked, tax info stolen for 100,000+ by Nathan Mattise.

From the post:

According to the Associated Press, the IRS has disclosed a hack where blackhats “used an online service provided by the agency” to access data for more than 100,000 taxpayers.

The IRS issued a statement today saying the compromised system was “Get Transcript.” The AP reports thieves were able to bypass the security screen requiring user information such as SSN, date of birth, and street address. The IRS has shut down the service currently, and it claims “Get Transcript” was targeted for more than two months between February and mid-May.

Watch Ars Technica for future updates.

BTW, the solution is not more laws, stiffer penalties for present laws, or buggy software on top of buggy software. Anyone who says differently is making money off you being insecure.

Such as software companies that release software with buffer overflow issues. To save you from following the link, buffer overflows were first identified as a security issue in 1972. Forty-three (43) years ago.

Is is it the result of lack of talent or interest that buffer overflows remain a problem to this day?

Liability for buffer overflows would be a powerful incentive for the detection and prevention of same.

No liability for buffer overflows will leave software where it has been for the last forty-three (43) years.

Your security, your choice.

29 Cyber Security Blogs You Should Be Reading

Tuesday, May 26th, 2015

29 Cyber Security Blogs You Should Be Reading by Sarah Vonnegut.

From the post:

Staying up-to-date is important for lots of reasons, but when you’re an Information Security professional, knowing about the latest tech, breaches, vulnerabilities,etc. is pretty much essential to your career. If you miss out on an important piece of news, your organization could miss out on much more.

More than just knowing what’s going on, though, keeping current in cyber security news is an opportunity to absorb and uncover innovative ideas surrounding InfoSec and the way you do your job.

The InfoSec community is lucky, to be honest. With so many security blogs available on the interwebs, the only real question we have to ask ourselves is: Which ones should I be reading on a regular basis? Well, for that we’re here to help. We’ve compiled a list of the blogs and security news sites we read and consistently gain value from on every visit.

The following cyber security blogs are those we consider thought-leaders in each of their niches and offer a full range of topics within cyber security. We hope you’ll discover some new reading material, and if we missed one of your favorite cyber security blogs, tweet us and let us know @Checkmarx! (emphasis in original)

Most of these you will probably know but I suspect there will be at least one or two new ones in the list. A good starting point for beefing up your security RSS feed or creating something a bit more sophisticated that groups posts on the same story.

JavaScript Graph Comparison

Tuesday, May 26th, 2015

JavaScript Graph Comparison

Bookmark this site if you need to select one of the many JavaScript graph libraries by features. Select by graph type, pricing scheme, options and dependencies.

The thought does occur to me that a daily chart of deaths by cause and expenditures by the U.S. government on that cause could make an interesting graphic. Particularly on the many days that there are no deaths from terrorism but the money keeps on pouring out.

Propaganda System: ‘what we’re doing against terrorism/ISIS is obviously right and just’.

Tuesday, May 26th, 2015

A recent interview with Noam Chomsky has this explanation of “propaganda system”:

For the propaganda model, notice what we explain there very explicitly is that this is a first approximation – and a good first approximation – for the way the media functions. We also mention that there are many other factors. In fact, if you take a look at the book ‘Manufacturing Consent’, about practically a third of the book, which nobody seems to have read, is a defence of the media from criticism by what are called civil rights organisations – Freedom House in this case. It’s a defence of the professionalism and accuracy of the media in their reporting, from a harsh critique which claimed that they were virtually traitors undermining government policy. We should have known, on the other hand, that they were quite professional.

The media didn’t like that defence because what we said is – and this was about the Tet Offensive – that the reporters were very honest, courageous, accurate, and professional, but their work was done within a framework of tacit acquiescence to a propaganda system that was simply unconscious. The propaganda system was ‘what we’re doing in Vietnam is obviously right and just’. And that passively supports the doctrinal system….

As an example of the current propaganda system that holds the media in thrall, the next question wasn’t:

Would you say the current propaganda system can be captured by: ‘what we’re doing against terrorism/ISIS is obviously right and just’.

Instead, the interviewer, to demonstrate his failure to understand the propaganda model?, follows with a question about Snowden/Greenwald as counter-examples to the propaganda model. Chompsky says no and points out that the propaganda model doesn’t explain all things. Of course, an interviewer who had read and understood “Manufacturing Consent” would have known that.

If anything, the reporting on terrorism and ISIS is even more lopsided than reporting in the Vietnam era. The media spasms at every email or phone threat and creates “news” that will race around the globe, such as bomb threats against airliners.

Chompsky himself falls prey to the mainstream propaganda model in his identification of a leading sponsor of terrorism: the leading state sponsor of terrorism, with the sub-title:

U.S. covert operations routinely resemble acts of terrorism.

Freed from the propaganda model it would read:

U.S. covert operations are acts of terrorism.

Death randomly falling out of the sky and killing innocent civilians is by its very nature a terrorist act. (full stop) Whatever you may think is a justification for it, it remains an act of terrorism, committed by terrorists.

Test the propaganda capture of the media for yourself. Count the number of times reporters in stories or interviews, point out the non-danger to the average American from terrorism. Do they ask for facts when that is denied? Do they confront government officials with contrary data? See: The Terrorism Statistics Every American Needs to Hear. As Chompsky would say, there are members of the media who are doing their professional jobs. Unfortunately it is too few of them and most fail to confront the contemporary propaganda system effectively.

(Chompsky interview: Noam Chomsky: Why the Internet Hasn’t Freed Our Minds—Propaganda Continues to Dominate by Seung-yoon Lee, Noam Chompsky.)

Rule #1 & AdultFriendfinder’s Database

Tuesday, May 26th, 2015

Graham Cluley, among many others, has reported on the data breach at AdultFriendFinder which reveals email addresses, usernames, dates of birth, postal codes, IP addresses and sexual preferences of users. (Millions of AdultFriendFinder members exposed after hack.)

There were fifteen (15) spreadsheets on the Dark Web from the breach but sans payment information. Rumor (source: Web search) has it that the hacker wants $17,000 for the entire database with payment information.

I am updating Email Rule #1 to be:

Rule #1:

Never write/record/enter/say in front of witnesses, anything that you would not want repeated to a federal grand jury or published on the front page of the New York Times.

How difficult is that?

If you had an AdultFriendFinder’s account, or can’t remember if you do or don’t, check out: ‘;–have i been pwned?, which as of today, has 183,019,526 pwned accounts, including those from AdultFriendFinder.

Before you venture out again, unsupervised, have someone purchase several cash cards, preferably in another city, buy a burner cellphone, obtain several bogus but not easily tracked to you email addresses, and use Astoria (or the most recent Tor client), and that is at a minimum.

30 Malicious IP list and Block lists providers 2015

Monday, May 25th, 2015

30 Malicious IP list and Block lists providers 2015

From the post:

Stop and use this great Malicious IP lists to block unwanted traffic to your network and company. These lists are daily updated by the best security companies which keep track of malicious domains, IP addresses and more.

The malicious IP list providers which have been listed provide FREE information about malicious IP’s and they also provide block lists which you can use in your firewall or security configuration to block unwanted traffic. It is important to keep in mind that IP addresses are often dynamic and that they can be used in the future for legitimate traffic, so I do urge you to be aware when you use a block list.

Rather oddly, the post gives an alphabetical listing of the malicious IP and block list providers, 2015, but then lists the URLs for those providers in the same order, in a separate file.

Certainly these lists will be useful for security purposes but it occurs to me they would also be useful for the collection and study of malware. Rather than casting about for malware, you could use these lists as treasure maps, perhaps even tracking the spread of new malware across these sites.

Is there a pubic collection of viruses, etc. as found in the wild? Think of the marketing opportunities if software virus collecting and analysis became a major hobby.

The Beauty of LATEX

Monday, May 25th, 2015

The Beauty of LATEX

From the post:

There are several reasons why one should prefer LaTeX to a WYSIWYG word processor like Microsoft Word: portability, lightness, security are just a few of them (not to mention that LaTeX is free). There is still a further reason that definitely convinced me to abandon MS Word when I wrote my dissertation: you will never be able to produce professionally typeset and well-structured documents using most WYSIWYG word processors. LaTeX is a free typesetting system that allows you to focus on content without bothering about the layout: the software takes care of the actual typesetting, structuring and page formatting, producing documents of astonishing elegance. The software I use to write in LaTeX on a Mac compiles documents in PDF format (but exporting to other formats such as RTF or HTML is also possible). It supports unicode and all the advanced typographic features of OpenType and AAT fonts, like Adobe Garamond Pro and Hoefler Text. It allows fine-tuned control on a number of typesetting options, although just using the default configuration results in documents with high typographic quality. In what follows I review some examples, comparing how fonts are rendered in MS Word and in LaTeX.

I thought about mentioning LATEX in my post about MS Office, but that would hardly be fair. LATEX is professional grade publishing software. Any professional grade software package takes user investment to become proficient at its use.

MS Office, on the other hand, is within the reach of anyone who can start a computer. The results may be ugly but the learning curve looks like a long tail. A very long tail.

Malicious Microsoft Office versions are in the wild

Monday, May 25th, 2015

Malicious Microsoft Office versions are in the wild

At first I wondered why this was news? 😉

After reading the post I realized they meant hacked versions of Microsoft Office, which in addition to the standard bugs and vulnerabilities, come with additional vulnerabilities installed by the people who hacked the official version.

I am untroubled by the presence of additional vulnerabilities in hacked versions of Microsoft Office as you know the saying, “…you get what you pay for.”

If you want Microsoft Office, then buy a copy of Microsoft Office. You won’t get much sympathy for security problems created while trying to cheat others. At least not from me.

If you want or need alternatives to Microsoft Office, try Apache OpenOffice or LibreOffice.

Even with “free” software, you should always use official or reputable distribution sites. A little bit of caution on your part will present attackers with a much smaller attack surface. Staff that don’t exercise such caution should be recommended to your competitors.

LOFAR Transients Pipeline (“TraP”)

Sunday, May 24th, 2015

LOFAR Transients Pipeline (“TraP”)

From the webpage:

The LOFAR Transients Pipeline (“TraP”) provides a means of searching a stream of N-dimensional (two spatial, frequency, polarization) image “cubes” for transient astronomical sources. The pipeline is developed specifically to address data produced by the LOFAR Transients Key Science Project, but may also be applicable to other instruments or use cases.

The TraP codebase provides the pipeline definition itself, as well as a number of supporting routines for source finding, measurement, characterization, and so on. Some of these routines are also available as stand-alone tools.

High-level overview

The TraP consists of a tightly-coupled combination of a “pipeline definition” – effectively a Python script that marshals the flow of data through the system – with a library of analysis routines written in Python and a database, which not only contains results but also performs a key role in data processing.

Broadly speaking, as images are ingested by the TraP, a Python-based source-finding routine scans them, identifying and measuring all point-like sources. Those sources are ingested by the database, which associates them with previous measurements (both from earlier images processed by the TraP and from other catalogues) to form a lightcurve. Measurements are then performed at the locations of sources which were expected to be seen in this image but which were not detected. A series of statistical analyses are performed on the lightcurves constructed in this way, enabling the quick and easy identification of potential transients. This process results in two key data products: an archival database containing the lightcurves of all point-sources included in the dataset being processed, and community alerts of all transients which have been identified.

Exploiting the results of the TraP involves understanding and analysing the resulting lightcurve database. The TraP itself provides no tools directly aimed at this. Instead, the Transients Key Science Project has developed the Banana web interface to the database, which is maintained separately from the TraP. The database may also be interrogated by end-user developed tools using SQL.

While it uses the term “association,” I think you will conclude it is much closer to merging in a topic map sense:

The association procedure knits together (“associates”) the measurements in extractedsource which are believed to originate from a single astronomical source. Each such source is given an entry in the runningcatalog table which ties together all of the measurements by means of the assocxtrsource table. Thus, an entry in runningcatalog can be thought of as a reference to the lightcurve of a particular source.

Perhaps not of immediate use but good reading and a diversion from corruption, favoritism, oppression and other usual functions of government.

Beyond TPP: An International Agreement to Screw Security Researchers and Average Citizens

Sunday, May 24th, 2015

US Govt proposes to classify cybersecurity or hacking tools as weapons of war by Manish Singh.

From the post:

Until now only when someone possessed a chemical, biological or nuclear weapon, it was considered to be a weapon of mass destruction in the eyes of the law. But we could have an interesting — and equally controversial — addition to this list soon. The Bureau of Industry and Security (BIS), an agency of the United States Department of Commerce that deals with issues involving national security and high technology has proposed tighter export rules for computer security tools — first brought up in the Wassenaar Arrangement (WA) at the Plenary meeting in December 2013. This proposal could potentially revise an international agreement aimed at controlling weapons technology as well as hinder the work of security researchers.

At the meeting, a group of 41 like-minded states discussed ways to bring cybersecurity tools under the umbrella of law, just as any other global arms trade. This includes guidelines on export rules for licensing technology and software as it crosses an international border. Currently, these tools are controlled based on their cryptographic functionality. While BIS is yet to clarify things, the new proposed rule could disallow encryption license exceptions.

This is like attempting to control burglary by prohibiting the possession of hammers. Hammers are quite useful for a number of legitimate tasks. But you can’t have one because some homeowners put weak locks on their doors.

If you want to see all the detail: Wassenaar Arrangement 2013 Plenary Agreements Implementation: Intrusion and Surveillance Items.

Deadline for comments is: 07/20/2015.

Warning: It is not written for casual reading. I may try to work through it in case anyone wants to point to actual parts of the proposal that are defective.

The real danger of such a proposal isn’t that the Feds will run amok prosecuting people but it will give them a basis for leaning on innocent researchers and intimating that a friendly U.S. Attorney and district judge might just buy an argument that you have violated an export restriction.

Most people don’t have the resources to face such threats (think Aaron Swartz) and so the Feds win by default.

If you don’t want to be an unnamed victim of federal intimidation or a known victim like Aaron Swartz, the time to stop this atrocity is now.

10 Expert Search Tips for Finding Who, Where, and When

Sunday, May 24th, 2015

10 Expert Search Tips for Finding Who, Where, and When by Henk van Ess.

From the post:

Online research is often a challenge. Information from the web can be fake, biased, incomplete, or all of the above.

Offline, too, there is no happy hunting ground with unbiased people or completely honest authorities. In the end, it all boils down to asking the right questions, digital or not. Here are some strategic tips and tools for digitizing three of the most asked questions: who, where and when? They all have in common that you must “think like the document” you search.

Whether you are writing traditional news, blogging or writing a topic map, its difficult to think of a subject where “who, where, and when,” aren’t going to come up.

Great tips! Start of a great one pager to keep close at hand.

VA fails cybersecurity audit for 16th straight year

Sunday, May 24th, 2015

VA fails cybersecurity audit for 16th straight year by Katie Dvorak.

From the post:

The U.S. Department of Veterans Affairs, which failed its Federal Information Security Management Act Audit for Fiscal Year 2014, is taking major steps to fix its cybersecurity in the wake of increasing scrutiny over vulnerabilities and cyberdeficiencies at the agency, according to an article at Federal News Radio.

This marks the 16th consecutive year the VA has failed the cybersecurity audit, according to the article. While the audit found that the agency has made progress in creating security policies and procedures, it also determined that problems remain in implementing its security risk management program.

“Weaknesses in access and configuration management controls resulted from VA not fully implementing security standards on all servers, databases, and network devices,” the report reads. “VA also has not effectively implemented procedures to identify and remediate system security vulnerabilities on network devices, database, and server platforms VA-wide.”

The first cybersecurity lesson here is that if you are exchanging data with or interacting with the Veterans Administration, do so on a computer that is completely isolated from your network. In the event of data transfers, but sure to scan and clean all incoming data from the VA.

The second lesson is requiring cypbersecurity, in the absence of incentives for its performance or penalties for its lack, will always end badly.

Astoria, the Tor client designed to beat the NSA surveillance

Sunday, May 24th, 2015

Astoria, the Tor client designed to beat the NSA surveillance by Pierluigi Paganini.

From the post:

A team of security researchers announced to have developed Astoria, a new Tor client designed to beat the NSA and reduce the efficiency of timing attacks.

Tor and Deep web are becoming terms even popular among Internet users, the use of anonymizing network is constantly increasing for this reason intelligence agencies are focusing their efforts in its monitoring.

Edward Snowden has revealed that intelligence agencies belonging to the Five Eyes Alliance have tried to exploit several techniques to de-anonymized Tor users.

Today I desire to introduce you the result of the work of a joint effort of security researchers from American and Israeli organizations which have developed a new advanced Tor client called Astoria.

The Astoria Tor Client was specially designed to protect Tor user form surveillance activities, it implements a series of features that make eavesdropping harder.

Time to upgrade and to help support the Tor network!

Every day that you help degrade NSA activities, you have contributed to your own safety and the safety of others.

Summarizing and understanding large graphs

Sunday, May 24th, 2015

Summarizing and understanding large graphs by Danai Koutra, U Kang, Jilles Vreeken and Christos Faloutsos. (DOI: 10.1002/sam.11267)


How can we succinctly describe a million-node graph with a few simple sentences? Given a large graph, how can we find its most “important” structures, so that we can summarize it and easily visualize it? How can we measure the “importance” of a set of discovered subgraphs in a large graph? Starting with the observation that real graphs often consist of stars, bipartite cores, cliques, and chains, our main idea is to find the most succinct description of a graph in these “vocabulary” terms. To this end, we first mine candidate subgraphs using one or more graph partitioning algorithms. Next, we identify the optimal summarization using the minimum description length (MDL) principle, picking only those subgraphs from the candidates that together yield the best lossless compression of the graph—or, equivalently, that most succinctly describe its adjacency matrix.

Our contributions are threefold: (i) formulation: we provide a principled encoding scheme to identify the vocabulary type of a given subgraph for six structure types prevalent in real-world graphs, (ii) algorithm: we develop VoG, an efficient method to approximate the MDL-optimal summary of a given graph in terms of local graph structures, and (iii) applicability: we report an extensive empirical evaluation on multimillion-edge real graphs, including Flickr and the Notre Dame web graph.

Use the DOI if you need the official citation, otherwise select the title link and it takes you a non-firewalled copy.

A must read if you are trying to extract useful information out of multimillion-edge graphs.

I first saw this in a tweet by Kirk Borne.


Sunday, May 24th, 2015

New gTLDs: .SUCKS Illustrates Potential Problems for Security, Brand Professionals by Camille Stewart.

From the post:

The launch of the .SUCKS top-level domain name (gTLD) has reignited and heightened concerns about protecting brands and trademarks from cybersquatters and malicious actors. This new extension, along with more than a thousand others, has been approved by the Internet Corporation for Assigned Names and Numbers (ICANN) as part of their new gTLD program. The program was designed to spark competition and innovation by opening up the market to additional gTLDs.

Not surprisingly, though, complaints are emerging that unscrupulous operators are using .SUCKS to extort money from companies by threatening to use it to create websites that could damage their brands. ICANN is now reportedly asking the Federal Trade Commission (FTC) and Canada’s Office of Consumer Affairs to weigh in on potential abuses so it can address them. Recently, Congress weighed in on the issue, holding a hearing about. SUCKS and other controversial domains like .PORN .

Vox Populi Registry Ltd. began accepting registrations for .SUCKS domains on March 30 from trademark holders and celebrities before it opened to public applicants. It recommended charging $2,499 a year for each domain name registration, and according to Vox Populi CEO John Berard, resellers are selling most of the names for around $2,000 a year. Berard asserts that the extension is meant to create destinations for companies to interact with their critics, and called his company’s business “well within the lines of ICANN rules and the law.”

If you follow the link to the statement by Vox Populi CEO John Berard, that post concludes with:

The new gTLD program is about increasing choice and competition in the TLD space, it’s not supposed to be about applicants bilking trademark owners for whatever they think they can get away with.

A rather surprising objection considering that trademark (and copyright) owners have been bilking/gouging consumers for centuries.

Amazing how sharp the pain can be when a shoe pinches on a merchant’s foot.

How many Disney properties could end in .sucks? (Research question)

Kafka and the Foreign Intelligence Surveillance Court (FISA)

Sunday, May 24th, 2015

Quiz: Just how Kafkaesque is the court that oversees NSA spying? by Alvaro Bedoya and Ben Sobel.

From the post:

When Edward Snowden first went public, he did it by leaking a 4-page order from a secret court called the Foreign Intelligence Surveillance Court, or FISA court. Founded in 1978 after the Watergate scandal and investigations by the Church Committee, the FISA court was supposed to be a bulwark against secret government surveillance. In 2006, it authorized the NSA call records program – the single largest domestic surveillance program in American history.

“The court” in Franz Kafka’s novel The Trial is a shadowy tribunal that tries (and executes) Josef K., the story’s protagonist, without informing him of the crime he’s charged with, the witnesses against him, or how he can defend himself. (Worth noting: The FISA court doesn’t “try” anyone. Also, it doesn’t kill people.)

Congress is debating a bill that would make the FISA court more transparent. In the meantime, can you tell the difference between the FISA court and Kafka’s court?

After you finish the quiz, if you haven’t read The Trial by Franz Kafka, you should.

I got 7/11. What’s your score?

The FISA court is an illusion of due process that has been foisted off on American citizens.

To be fair, the number of rejected search or arrest warrants in regular courts is as tiny as the number of rejected applications in FISA court. (One study reports 1 rejected search warrant out of 1,748. Craig D. Uchida, Timothy S. Bynum, Search Warrants, Motions to Suppress and Lost Cases: The Effects of the Exclusionary Rule in Seven Jurisdictions, 81 J. Crim. L. & Criminology 1034 (1990-1991), at page: 1058)

However, any warrant issued by a regular court, including the affidavit setting forth “probable cause” becomes public. Both the police and judicial officers know the basis for warrants will be seen by others, which encourages following the rules for probable cause.

Contrast that with the secret warrants and the basis for secret warrants from the FISA court. There is no opportunity for the public to become informed about the activities of the FISA courts or the results of the warrants that it issues. The non-public nature of the FISA court deprives voters of the ability to effectively voice concerns about the FISA court.

The only effective way to dispel the illusion that secrecy is required for the FISA court is for there to be massive and repetitive leaks of FISA applications and opinions. Just like with the Pentagon Papers, the sky will not fall and the public will learn the FISA court was hiding widespread invasions of privacy based on the thinnest tissues of fantasy from intelligence officers.

If you think I am wrong about the FISA court, name a single government leak that did not reveal the government breaking the law, attempting to conceal incompetence or avoid accountability. Suggestions?

Global Investigative Journalism Conference (Lillehammer, October 8th-11th 2015)

Sunday, May 24th, 2015

Global Investigative Journalism Conference (Lillehammer, October 8th-11th 2015)

From the news page:

This year’s global event for muckrakers is approaching! Today we’re pleased to reveal the first glimpse of the program for the 9th Global Investigative Journalism Conference — #GIJC15 — in Lillehammer, Norway.

First in line are the data tracks. We have 56 sessions dedicated to data-driven journalism already confirmed, and there is more to come.

Three of the four data tracks will be hands-on, while a fourth will be showcases. In addition to that, the local organizing committee has planned a Data Pub.

The heavy security and scraping stuff will be in a special room, with three days devoted to security issues and webscraping with Python. The attendees will be introduced to how to encrypt emails, their own laptop and USB-sticks. They will also be trained to install security apps for text and voice. For those who think Python is too difficult, is an option.

For the showcases, we hope the audience will appreciate demonstrations from some of the authors behind the Verification Handbook, on advanced internet search techniques and using social media in research. There will be sessions on how to track financial crime, and the journalists behind LuxLeaks and SwissLeaks will conduct different sessions.

BTW, you can become a sponsor for the conference:

Interested in helping sponsor the GIJC? Here’s a chance to reach and support the “special forces” of journalism around the world – the reporters, editors, producers and programmers on the front lines of battling crime, corruption, abuse of trust, and lack of accountability. You’ll join major media organizations, leading technology companies, and influential foundations. Contact us at

Opposing “crime, corruption, abuse of trust, and lack of accountability?” There are easier ways to make a living but few are as satisfying.

PS: Looks like a good venue for discussing how topic maps could integrate resources from different sources or researchers.

Mastering Emacs (new book)

Saturday, May 23rd, 2015

Mastering Emacs by Mickey Petersen.

I can’t recommend Mastering Emacs as lite beach reading but next to a computer, it has few if any equals.

I haven’t ordered a copy (yet) but based on the high quality of Mickey’s Emacs posts, I recommend it sight unseen.

You can look inside at the TOC.

If you still need convincing, browse Mickey’s Full list of Tips, Tutorials and Articles for a generous sampling of his writing.


Deep Learning (MIT Press Book) – Update

Friday, May 22nd, 2015

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow and Aaron Courville.

I last mentioned this book last August and wanted to point out that a new draft appeared on 19/05/2015.

Typos and opportunities for improvement still exist! Now is your chance to help the authors make this a great book!


The Unreasonable Effectiveness of Recurrent Neural Networks

Friday, May 22nd, 2015

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy.

From the post:

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?

I try to blog or reblog about worthy posts by others but every now and again, I encounter a post that is stunning in its depth and usefulness.

This post by Andrej Karpathy is one of the stunning ones.

In addition to covering RNNs in general, he takes the reader on a tour of “Fun with RNNs.”

Which covers the application of RNNs to:

  • A Paul Graham generator
  • Shakespeare
  • Wikipedia
  • Algebraic Geometry (Latex)
  • Linux Source Code

Along with sourcecode, Andrej provides a list of further reading.

What’s your example of using RNNs?

Harvesting Listicles

Friday, May 22nd, 2015

Scrape website data with the new R package rvest by

From the post:

Copying tables or lists from a website is not only a painful and dull activity but it’s error prone and not easily reproducible. Thankfully there are packages in Python and R to automate the process. In a previous post we described using Python’s Beautiful Soup to extract information from web pages. In this post we take advantage of a new R package called rvest to extract addresses from an online list. We then use ggmap to geocode those addresses and create a Leaflet map with the leaflet package. In the interest of coding local, we opted to use, as the example, data on wineries and breweries here in the Finger Lakes region of New York.

Lists and listicles are a common form of web content. Unfortunately, both are difficult to improve without harvesting the content and recasting it.

This post will put you on the right track to harvesting with rvest!

BTW, as a benefit to others, post data that you clean/harvest in a clean format. Yes?

First experiments with Apache Spark at Snowplow

Friday, May 22nd, 2015

First experiments with Apache Spark at Snowplow by Justin Courty.

From the post:

As we talked about in our May post on the Spark Example Project release, at Snowplow we are very interested in Apache Spark for three things:

  1. Data modeling i.e. applying business rules to aggregate up event-level data into a format suitable for ingesting into a business intelligence / reporting / OLAP tool
  2. Real-time aggregation of data for real-time dashboards
  3. Running machine-learning algorithms on event-level data

We’re just at the beginning of our journey getting familiar with Apache Spark. I’ve been using Spark for the first time over the past few weeks. In this post I’ll share back with the community what I’ve learnt, and will cover:

  1. Loading Snowplow data into Spark
  2. Performing simple aggregations on Snowplow data in Spark
  3. Performing funnel analysis on Snowplow data

I’ve tried to write the post in a way that’s easy to follow-along for other people interested in getting up the Spark learning curve.

What a great post to find just before the weekend!

You will enjoy this one and others in this series.

Have you every considered aggregation into business dashboard to include what is known about particular subjects? We have all seen the dashboards with increasing counts, graphs, charts, etc. but what about non-tabular data?

A non-tabular dashboard?

Rosetta’s Way Back to the Source

Friday, May 22nd, 2015

Rosetta’s Way Back to the Source – Towards Reverse Engineering of Complex Software by Herman Bos.

From the webpage:

The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).

In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.

Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used

Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.
… (emphasis added)

Herman gives a great thumbnail sketch of the difficulties and potential for this project.

Looking forward to news of a demonstration that “irreversibility of computation” is false.

One important use case being verification that software that claims to have used prevention of buffer overflow techniques has in fact done so. Not the sort of thing I would entrust to statements in marketing materials.