Archive for March, 2014

OpenAccessReader

Monday, March 31st, 2014

OpenAccessReader

From the webpage:

Open Access Reader is a project to systematically ensure that all significant open access research is cited in Wikipedia.

There’s lots of great research being published in good quality open access journals that isn’t cited in Wikipedia. It’s peer reviewed, so it should count as a reliable source. It’s available for anyone to read and probably comes with pretty decent metadata too. Can we set up a process to make it super convenient for editors to find and cite these papers?

If you are looking for a project with the potential to make a real difference this year, check this one out.

They are looking for volunteers.

150 Million Topics

Monday, March 31st, 2014

150 Million More Reasons to Love Bing Everyday by Richard Qian.

From the post:

At Bing, we understand that search is more than simply finding information and browsing a collection of blue links pointing to pages around the web. We’ve talked about doing instead of searching and how Bing continues to expand its approach to understand the actual world around us.

Today, you’ll see this come to life on Bing.com in a feature called Snapshot. Snapshot brings together information that you need at a glance, with rich connections to deeper information on the people, places, and things you care about made possible by our deep understanding of the real world. To accomplish this, Bing now tracks billions of entities and perhaps more importantly, the billions of relationships between them, all to get you the right data instantly while you search.

New Entities: Introducing Doctors, Lawyers, Dentists and Real Estate Properties
….

In case you are interested, the “Snapshot” is what ISO/IEC 13250 (Dec., 1999) defined as a topic.

topic: An aggregate of topic characteristics, including zero or more names, occurrences, and roles played in association with other topics, whose organizing principle is a single subject.

Unlike the topic maps effort, Bing conceals all the ugliness that underlies merging of information and delivers to users an immediately useful and consumable information product.

But also unlike the topic maps effort, Bing is about as useful as a wedgie when you are looking for information internal to your enterprise.

Why?

Mostly because the subjects internal to your organization don’t get mapped by Bing.

Can you guess who is going to have to do that work? Got a mirror handy?

Use Bing’s Snapshots to see what collated information can look like. Decide if that’s a look you want for all or some of your information.

Hard to turn down free advertising/marketing from MS. It happens so rarely.

Thanks MS!

Speaking of Not Lying

Monday, March 31st, 2014

Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance

Oracle released a buzz word laden announcement on 25 March 2014, that anticipated Rebekah Campbell‘s post on lying by saying:

Financial institutions continue to expand their use of statistical models across the enterprise. In addition to their long-standing role in risk management, models are increasingly the foundation for customer insight and marketing, financial crime and compliance and enterprise performance management analytical applications. As a result, organizations are spending more time and resources creating and validating models, improving data quality, verifying results and managing and governing the use of models across the enterprise. (emphasis added)

I have never seen a vendor advertise their software as being useful for financial crime.

Do you know what sort of EULA that software ships under?

BTW, what other parts of this announcement seem a bit shaded to you?

I first saw this in a tweet by Bob DuCharme.

The High Cost of Lying

Monday, March 31st, 2014

The Surprisingly Large Cost of Telling Small Lies by Rebekah Campbell.

From the post:

Recently, I caught up with one of our angel investors for lunch: Peter is a brilliant entrepreneur from England who has lived all over the world. He has built several businesses and now lives a dream life with a house on a harbor, a happy family and a broad smile.

As our conversation drifted from an update of my company to a deep discussion about life itself, I asked him what he thought was the secret to success. I expected the standard “never give up” or some other T-shirt slogan, but what he said took me by surprise. “The secret to success in business and in life is to never, ever, ever tell a lie,” he said.

That stumped me. I know that lying is bad and telling the truth is good — we learn that as children. But the secret to success? I looked at Peter, confused and skeptical. He nodded and assured me, “Complete honesty is the access to ultimate power.”

As we spoke, I started thinking about the little lies I tell every day — often without thinking about it, but not always. I have been guilty of exaggerating a metric here or there or omitting facts for my own advantage. Each time, there is a little voice inside my head that tells me it is the wrong thing to do. I have wondered whether everyone does this or whether it is just me. Could this be what has been holding me back?

I did some research and it seems most of us lie quite a bit. A study by the University of Massachusetts found that 60 percent of adults could not have a 10-minute conversation without lying at least once. The same study found that 40 percent of people lie on their résumés and a whopping 90 percent of those looking for a date online lie on their profiles. Teenage girls lie more than any other group, which is attributed to peer pressure and expectation. The study did not investigate the number of lies told by entrepreneurs looking for investment capital, but I fear we would top the chart.

We all need to read Rebekah’s post at least once a month, if no more often.

What really annoys me are techno lies. Where you ask about one issue and the response is a lot of bluff and bluster about how the questioner doesn’t understand the technology, community, some unspecified requirements, etc.

When I get that response, I know I am being lied to. If the person had a real answer, they would not have a stock paragraph that keeps repeating the careful consideration some group made of the question at some unspecified time.

They would just say: sorry, here are the facts (a short list) and this is why X works this way. Quite simple.

BTW, there is a side-effect (sorry functional programming fans) to not lying: You don’t have to remember what lie you told to who in what context. Greatly reduces the amount of clutter than you have to remember.

At least if you want to be a successful liar. I would rather be successful at something else.

PS: Would you consider closed source software that was compromised to spy on you as lying? As in lying to a customer? I would too.

Parsing Drug Dosages in text…

Sunday, March 30th, 2014

Parsing Drug Dosages in text using Finite State Machines by Sujit Pal.

From the post:

Someone recently pointed out an issue with the Drug Dosage FSM in Apache cTakes on the cTakes mailing list. Looking at the code for it revealed a fairly complex implementation based on a hierarchy of Finite State Machines (FSM). The intuition behind the implementation is that Drug Dosage text in doctor’s notes tend to follow a standard-ish format, and FSMs can be used to exploit this structure and pull out relevant entities out of this text. The paper Extracting Structured Medication Event Information from Discharge Summaries has more information about this problem. The authors provide their own solution, called the Merki Medication Parser. Here is a link to their Online Demo and source code (Perl).

I’ve never used FSMs myself, although I have seen it used to model (more structured) systems. So the idea of using FSMs for parsing semi-structured text such as this seemed interesting and I decided to try it out myself. The implementation I describe here is nowhere nearly as complex as the one in cTakes, but on the flip side, is neither as accurate, nor broad nor bulletproof either.

My solution uses drug dosage phrase data provided in this Pattern Matching article by Erin Rhode (which also comes with a Perl based solution), as well as its dictionaries (with additions by me), to model the phrases with the state diagram below. I built the diagram by eyeballing the outputs from Erin Rhode’s program. I then implement the state diagram with a home-grown FSM implementation based on ideas from Electric Monk’s post on FSMs in Python and the documentation for the Java library Tungsten FSM. I initially tried to use Tungsten-FSM, but ended up with extremely verbose Scala code because of Scala’s stricter generics system.

This caught my attention because I was looking at a data import handler recently that was harvesting information from a minimal XML wrapper around mediawiki markup. Works quite well but seems like a shame to miss all the data in wiki markup.

I say “miss all the data in wiki markup” and that’s not really fair. It is dumped into a single field for indexing. But that is a field that loses the context distinctions between a note, appendix, bibliography, or even the main text.

If you need distinctions that aren’t the defaults, you may be faced with rolling your own FSM. This post should help get you started.

Is That An “Entity” On Your Webpage?

Sunday, March 30th, 2014

How To Tell Search Engines What “Entities” Are On Your Web Pages by Barbara Starr.

From the post:

Search engines have increasingly been incorporating elements of semantic search to improve some aspect of the search experience — for example, using schema.org markup to create enhanced displays in SERPs (as in Google’s rich snippets).

Elements of semantic search are now present at almost all stages of the search process, and the Semantic Web has played a key role. Read on for more detail and to learn how to take advantage of this opportunity to make your web pages more visible in this evolution of search.

semantic search

The identifications are fairly coarse, that is you get a pointer (URL) that identifies a subject but no idea why someone picked that URL.

But, we all know how well coarse pointers, document level pointers, have worked for the WWW.

Kinda surprising because we have had sub-document indexing for centuries.

Odd how simply pointing to a text blob suddenly became acceptable.

Think of the efforts by Google and schema.org as an attempt to recover indexing as it existed in the centuries before the advent of the WWW.

The Theoretical Astrophysical Observatory:…

Sunday, March 30th, 2014

The Theoretical Astrophysical Observatory: Cloud-Based Mock Galaxy Catalogues by Maksym Bernyk, et al.

Abstract:

We introduce the Theoretical Astrophysical Observatory (TAO), an online virtual laboratory that houses mock observations of galaxy survey data. Such mocks have become an integral part of the modern analysis pipeline. However, building them requires an expert knowledge of galaxy modelling and simulation techniques, significant investment in software development, and access to high performance computing. These requirements make it difficult for a small research team or individual to quickly build a mock catalogue suited to their needs. To address this TAO offers access to multiple cosmological simulations and semi-analytic galaxy formation models from an intuitive and clean web interface. Results can be funnelled through science modules and sent to a dedicated supercomputer for further processing and manipulation. These modules include the ability to (1) construct custom observer light-cones from the simulation data cubes; (2) generate the stellar emission from star formation histories, apply dust extinction, and compute absolute and/or apparent magnitudes; and (3) produce mock images of the sky. All of TAO’s features can be accessed without any programming requirements. The modular nature of TAO opens it up for further expansion in the future.

The website: Theoretical Astrophysical Observatory.

While disciplines in the sciences and the humanities play access games with data and publications, the astronomy community continues to shame both of them.

Funders, both government and private should take a common approach: Open and unfettered access to data or no funding.

It’s just that simple.

If grantees object, they can try to function without funding.

Accessible Government vs. Open Government

Sunday, March 30th, 2014

Congressional Officials Grant Access Due To Campaign Contributions: A Randomized Field Experiment

Abstract:

Concern that lawmakers grant preferential treatment to individuals because they have contributed to political campaigns has long occupied jurists, scholars, and the public. However, the effects of campaign contributions on legislators’ behavior have proven notoriously difficult to assess. We report the first randomized field experiment on the topic. In the experiment, a political organization attempted to schedule meetings between 191 Members of Congress and their constituents who had contributed to political campaigns. However, the organization randomly assigned whether it informed legislators’ offices that individuals who would attend the meetings were contributors. Congressional offices made considerably more senior officials available for meetings when offices were informed the attendees were donors, with senior officials attending such meetings more than three times as often (p < 0.01). Influential policymakers thus appear to make themselves much more accessible to individuals because they have contributed to campaigns, even in the absence of quid pro quo arrangements. These findings have significant implications for ongoing legal and legislative debates. The hypothesis that individuals can command greater attention from influential policymakers by contributing to campaigns has been among the most contested explanations for how financial resources translate into political power. The simple but revealing experiment presented here elevates this hypothesis from extensively contested to scientifically supported.

Donors really are different from the rest of us, they have access.

One hopes the next randomized experiment distinguishes where the break points are in donations.

I suspect < $500 is one group, $500 - $1,000 is the second group, $1,000 - $2,500 is the third group and so on. Just guesses on my part but it would help the political process if potential donors had a bidding sheet for candidates. You don't want to appear foolish and pay too much for access to a junior member of Congress but on the other hand, you don't want to insult a senior member with too small of an donation. Think of it as transparency of access. I first saw this at Full Text Reports.

mtx:…

Saturday, March 29th, 2014

mtx: a swiss-army knife for information retrieval

From the webpage:

mtx is a command-line tool for rapidly trying new ideas in Information Retrieval and Machine Learning.

mtx is the right tool if you secretly wish you could:

  • play with Wikipedia-sized datasets on your laptop
  • do it interactively, like the boys whose data fits in Matlab
  • quickly test that too-good-to-be-true algorithm you see at SIGIR
  • try ungodly concoctions, like BM25-weighted PageRank over ratings
  • cache all intermediate results, so you never have to re-run a month-long job
  • use awk/perl to hack internal data structures half-way through a computation

mtx is made for Unix hackers. It is a shell tool, not a library or an application. It’s designed for interactive use and relies on your shell’s tab-completion and history features. For scripting it, I highly recommend this.

What do you have on your bootable USB stick? 😉

Installing Apache Solr 4.7 multicore…

Saturday, March 29th, 2014

Installing Apache Solr 4.7 multicore on Ubuntu 12.04 and Tomcat7

From the post:

I will show you how to install the ApacheSolr search engine under Tomcat7 servlet container on Ubuntu 12.04.4 LTS (Precise Pangolin) to be used later with Drupal 7. In this writeup I’m gonna discuss only the installation and setup of the ApacheSolr server. Specific Drupal configuration and/or Drupal side configuration to be discussed in future writeup.

Nothing you don’t already know but a nice checklist for the installation.

I’m glad I found it because I am writing a VM script to auto-install Solr as part of a VM distribution.

Manually I do ok but am likely to forget something the script needs explicitly.

Biggest source of DOD’s cyber threats: inept co-workers

Friday, March 28th, 2014

Biggest source of DOD’s cyber threats: inept co-workers by Kevin McCaney.

From the post:

Defense Department IT professionals are nearly as concerned about internal threats as they are external hacking of their networks — and most concerned about careless or poorly trained insiders as a source of threats, according to a recent survey by SolarWinds, an IT management software provider.

In the survey, which addressed cybersecurity threats and preparedness across the federal government, 41 percent of DOD respondents named insider data leakage/theft as a threat, not far below the 48 percent who identified external hacking.

And although those responses may have come with the disclosures of Edward Snowden and Chelsea Manning in mind, it seems inept co-workers, rather than intentional leakers, are the biggest concern. Fifty-three percent of DOD respondents cited careless/untrained insiders as a source of security threats, more than foreign governments (48 percent), terrorists (31 percent) or the general hacking community (35 percent). Malicious insiders weren’t left out, however, being cited by 26 percent of respondents.

At first blush, this post seems to support the Torkington Conjecture I posted about recently. That “stupid” users are the cause of computer security woes.

Actually, if computer systems were designed with security in mind, even “stupid” users would not be the source of security breaches.

For example, take the classic case of a user posting their passwords on sticky notes to their monitor. Very, very bad practice. Yes?

OK, but if the network is configured to allow access by that user during specified hours and only from their computer, what do you think the odds are of a unknown hacker sitting at their computer trying to hack the system?

If you don’t plan for security, it should come as no great surprise that you have no security.

Clojure/West 2014 Videos

Friday, March 28th, 2014

I first saw this at: Clojure/West 2014 Presentations : A Wonderful Stack Now Available by Charles Ditzel.

I cleaned up the listing at Youtube and put the authors in order by last name.

Don’t you think it is odd that Youtube has such poor sorting options?

Or am I missing a “secret” button somewhere?

A Practical Optional Type System for Clojure [Types for Topic Maps?]

Friday, March 28th, 2014

A Practical Optional Type System for Clojure by Ambrose Bonnaire-Sergeant.

Abstract:

Dynamic programming languages often abandon the advantages of static type checking in favour of their characteristic convenience and flexibility. Static type checking eliminates many common user errors at compile-time that are otherwise unnoticed, or are caught later in languages without static type checking. A recent trend is to aim to combine the advantages of both kinds of languages by adding optional static type systems to languages without static type checking, while preserving the idioms and style of the language.

This dissertation describes my work on designing an optional static type system for the Clojure programming language, a dynamically typed dialect of Lisp, based on the lessons learnt from several projects, primarily Typed Racket. This work includes designing and building a type checker for Clojure running on the Java Virtual Machine. Several experiments are conducted using this prototype, particularly involving existing Clojure code that is sufficiently complicated that type checking increases confidence that the code is correct. For example, nearly all of algo.monads, a Clojure Contrib library for monadic programming, is able to be type checked. Most monad, monad transformer, and monadic function definitions can be type checked, usually by adding type annotations in natural places like function definitions.

There is significant future work to fully type check all Clojure features and idioms. For example, multimethod definitions and functions with particular constraints on the number of variable arguments they accept (particularly functions taking only an even number of variable arguments) are troublesome. Also, there are desirable features from the Typed Racket project that are missing, such as automatic runtime contract generation and a sophisticated blame system, both which are designed to improve error messages when mixing typed and untyped code in similar systems.

Overall, the work described in this dissertation leads to the conclusion that it appears to be both practical and useful to design and implement an optional static type system for the Clojure programming language.

Information retrieval that relies upon merging representatives of the same subject would benefit from type checking.

In XTM we rely upon string equivalence of URIs for merging of topics. Leaves you will visual inspection to audit merging.

I could put:

http://www.durusau.net/general/background.html, and

http://en.wikipedia.org/wiki/Lion_king

as subject identifiers in a single topic and a standard XTM processor would merrily merge topics with those subject identifiers together.

Recalling the rules of the TMDM to be:

Equality rule:

Two topic items are equal if they have:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

Adding data types to subject identifiers could alert authors to merge errors long before they may or may not be discovered by users.

Otherwise merge errors in topic maps may lay undetected and uncorrected for some indeterminate period of time. (Sounds like software bugs doesn’t it?)

I first saw this in a tweet by mrb.

LVars:…

Thursday, March 27th, 2014

LVars: Lattice-based Data Structures for Deterministic Parallel and Distributed Programming by Lindsey Kuper.

At 144 slides and no sound, you probably need to pick up some background to really appreciate the slides.

I would start with: A ten-minute talk about my research, continue with later post under LVars and then onto:

LVars project repo: http://github.com/iu-parfunc/lvars

Code from this talk: http://github.com/lkuper/lvar-examples

Research blog: http://composition.al

Take up the slides when you feel comfortable with the nomenclature and basic concepts.

WebScaleSQL

Thursday, March 27th, 2014

WebScaleSQL

From the webpage:

What is WebScaleSQL?

WebScaleSQL is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale, and seek greater performance from a database technology tailored for their needs.

Our goal in launching WebScaleSQL is to enable the scale-oriented members of the MySQL community to work more closely together in order to prioritize the aspects that are most important to us. We aim to create a more integrated system of knowledge-sharing to help companies leverage the great features already found in MySQL 5.6, while building and adding more features that are specific to deployments in large scale environments. In the last few months, engineers from all four companies have contributed code and provided feedback to each other to develop a new, more unified, and more collaborative branch of MySQL.

But as effective as this collaboration has been so far, we know we’re not the only ones who are trying to solve these particular challenges. So we will keep WebScaleSQL open as we go, to encourage others who have the scale and resources to customize MySQL to join in our efforts. And of course we will welcome input from anyone who wants to contribute, regardless of what they are currently working on.

Who is behind WebScaleSQL?

WebScaleSQL currently includes contributions from MySQL engineering teams at Facebook, Google, LinkedIn, and Twitter. Together, we are working to share a common base of code changes to the upstream MySQL branch that we can all use and that will be made available via open source. This collaboration will expand on existing work by the MySQL community, and we will continue to track the upstream branch that is the latest, production-ready release (currently MySQL 5.6).

Correct me if I’m wrong but don’t teams from Facebook, Google, LinkedIn and Twitter know a graph when they see one? 😉

Even people who recognize graphs may need an SQL solution every now and again. Besides, solutions should not drive IT policy.

Requirements and meeting those requirements should drive IT policy. You are less likely to own very popular, expensive and ineffectual solutions when requirements rule. (Even iterative requirements in the agile approach are requirements.)

A reminder that MySQL/WebScaleSQL compiles from source with:

A working ANSI C++ compiler. GCC 4.2.1 or later, Sun Studio 10 or later, Visual Studio 2008 or later, and many current vendor-supplied compilers are known to work. (INSTALL-SOURCE)

Which makes it a target, sorry, subject for analysis of any vulnerabilities with joern.

I first saw this in a post by Derrick Harris, Facebook — with help from Google, LinkedIn, Twitter — releases MySQL built to scale.

Apache Mahout, “…Ya Gotta Hit The Road”

Thursday, March 27th, 2014

The news in Derrick Harris’ “Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce” reminded of a line from Tommy, “Just as the gypsy queen must do, ya gotta hit the road.”

From the post:

Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

H2O was developed separately by a startup called 0xadata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.

Support for multiple data frameworks is yet another reason to learn Mahout.

1939 Register

Thursday, March 27th, 2014

1939 Register

From the webpage:

The 1939 Register is being digitised and will be published within the next two years.

It will provide valuable information about over 30 million people living in England and Wales at the start of World War Two.

What is the 1939 Register?

The British government took a record of the civilian population shortly after the outbreak of World War Two. The information was used to issue identity cards and organise rationing. It was also used to set up the National Health Service.

Explanations are one of the perils of picking very obvious/intuitive names for projects. 😉

The data should include:

Data will be provided only where the individual is recorded as deceased (or where clear evidence of death can be provided by the applicant) and will include;

  • National Registration number
  • Address
  • Surname
  • First Forename
  • Other Forename(s)/Initial(s)
  • Date of Birth
  • Sex
  • Marital Status
  • Occupation

As per the 1939 Register Service, a government office that charges money to search what one assumes are analog records. (Yikes!)

The reason I mention the 1939 Register Service is the statement:

Is any other data available?

If you wish to request additional information under the Freedom of Information Act 2000, please email enquiries@hscic.gov.uk or contact us using the postal address below, marking the letter for the Higher Information Governance Officer (Southport).

Which implies to me there is more data to be had, but the 1911Census.org.uk says not.

Well, assuming you don’t include:

“If member of armed forces or reserves,” which was column G on the original form.

Hard to say why that would be omitted.

It will be interesting to see if the original and then “updated” cards are digitized.

In some of the background reading I did on this data, some mothers omitted their sons from the registration cards (one assumes to avoid military service) but when rationing began based on the registration cards, they filed updated cards to include their sons.

I suspect the 1939 data will be mostly of historical interest but wanted to mention it because people will be interested in it.

CSV on the Web

Thursday, March 27th, 2014

CSV on the Web Use Cases and Requirements, and Model for Tabular Data and Metadata Published

I swear, that really is the title.

Two recent drafts of interest:

The CSV on the Web: Use Cases and Requirements collects use cases that are at the basis of the work of the Working Group. A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. The Working Group aim to specify technologies that provide greater interoperability for data dependent applications on the Web when working with tabular datasets comprising single or multiple files using CSV, or similar, format. This document lists a first set of use cases compiled by the Working Group that are considered representative of how tabular data is commonly used within data dependent applications. The use cases observe existing common practice undertaken when working with tabular data, often illustrating shortcomings or limitations of existing formats or technologies. This document also provides a first set of requirements derived from these use cases that have been used to guide the specification design.

The Model for Tabular Data and Metadata on the Web outlines a basic data model, or infoset, for tabular data and metadata about that tabular data. The document contains first drafts for various methods of locating metadata: one of the output the Working Group is chartered for is to produce a metadata vocabulary and standard method(s) to find such metadata. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF (as a possible update of RFC4180).

I guess they mean to use CSV as it exists? What a radical concept. 😉

What next?

Could use an updated specification for the COBOL data format in which many government data sets are published (even now).

That last statement isn’t entirely in jest. There is a lot of COBOL formatted files on government websites in particular.

Modeling and Discovering Vulnerabilities…

Thursday, March 27th, 2014

Modeling and Discovering Vulnerabilities with Code Property Graphs by Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck.

Abstract:

The vast majority of security breaches encountered today are a direct result of insecure code. Consequently, the protection of computer systems critically depends on the rigorous identification of vulnerabilities in software, a tedious and error-prone process requiring significant expertise. Unfortunately, a single flaw suffices to undermine the security of a system and thus the sheer amount of code to audit plays into the attacker’s cards. In this paper, we present a method to effectively mine large amounts of source code for vulnerabilities. To this end, we introduce a novel representation of source code called a code property graph that merges concepts of classic program analysis, namely abstract syntax trees, control flow graphs and program dependence graphs, into a joint data structure. This comprehensive representation enables us to elegantly model templates for common vulnerabilities with graph traversals that, for instance, can identify buffer overflows, integer overflows, format string vulnerabilities, or memory disclosures. We implement our approach using a popular graph database and demonstrate its efficacy by identifying 18 previously unknown vulnerabilities in the source code of the Linux kernel.

I was running down references in the documentation for joern when I discovered this paper.

The recent SSH bug in the Apple iOS is used to demonstrate a code property graph that combines the perspectives of Abstract Syntax Trees, Control Flow Graphs, and Program Dependence Graphs.

In topic map lingo we would call those “universes of discourse,” but the essential fact to remember is that combining different perspectives (are you listening NSA?) is where a code property graph derives its power.

Note that I said “combining” (different perspectives are preserved) not “sanitizing” (different perspectives are lost).

Using Neo4j, the authors created a code property graph of the Linux kernel, 52 million nodes and 87 million edges. As a result of their analysis, they discovered 18 previously undiscovered bugs.

Important: Patterns discovered in a code property graph can be used to identify vulnerabilities in other source code. Searching for bugs in source code can become cumulative and and less episodic.

Comparison of source and bug histories of the Linux kernel, Apache http server, Sendmail, etc. will provide some of the common graph patterns signaling vulnerabilities in source code.

Will the white or black hat community will be the first to build a public repository for graph patterns showing source code vulnerabilities?

Hiding security information hasn’t worked so far but I think you know the most likely result.

joern

Wednesday, March 26th, 2014

joern

From the webpage:

Source code analysis is full of graphs: abstract syntax trees, control flow graphs, call graphs, program dependency graphs and directory structures, to name a few. Joern analyzes a code base using a robust parser for C/C++ and represents the entire code base by one large property graph stored in a Neo4J graph database. This allows code to be mined using complex queries formulated in the graph traversal languages Gremlin and Cypher.

The documentation can be found here

This looks quite useful.

Makes me curious about mapping graphs of different codebases but shared libraries together.

I found this following a tweet by Nicolas Karassas which pointed to: Hunting Vulnerabilities with Graph Databases by Fabian Yamaguchi.

Elasticsearch 1.1.0,…

Wednesday, March 26th, 2014

Elasticsearch 1.1.0, 1.0.2 and 0.90.13 released by Clinton Gormley.

From the post:

Today we are happy to announce the release of Elasticsearch 1.1.0, based on Lucene 4.7, along with bug fix releases Elasticsearch 1.0.2 and Elasticsearch 0.90.13:

You can download them and read the full changes list here:

New features in 1.1.0

Elasticsearch 1.1.0 is packed with new features: better multi-field search, the search templates and the ability to create aliases when creating an index manually or with a template. In particular, the new aggregations framework has enabled us to support more advanced analytics: the cardinality agg for counting unique values, the significant_terms agg for finding uncommonly common terms, and the percentiles agg for understanding data distribution.

We will be blogging about all of these new features in more detail, but for now we’ll give you a taste of what each feature adds:

….

Well, there’s goes the rest of the week! 😉

MOOCs and courses to learn R

Wednesday, March 26th, 2014

MOOCs and courses to learn R by Flavio Barras.

From the post:

Inspired by this article i thought about gather here all multimedia resources that i know to learn use R. Today there is a lot of online courses, some MOOC’s too, that offer reasonable resources to start with R.

I will just list the materials in sequence and offer my evaluation about them. Of course your evaluation can be different; this case fell free to comment. In the future i can update the material. Let’s begin:

An annotated list of ten (10) courses for learning R.

Verification Handbook

Wednesday, March 26th, 2014

Verification Handbook: A definitive guide to verifying digital content for emergency coverage

From the website:

Authored by leading journalists from the BBC, Storyful, ABC, Digital First Media and other verification experts, the Verification Handbook is a groundbreaking new resource for journalists and aid providers. It provides the tools, techniques and step-by-step guidelines for how to deal with user-generated content (UGC) during emergencies.

What

When a crisis breaks, trusted sources such as news and aid organisations must sift through and verify the mass of reports being shared and published, and report back to the public with accurate, fact-checked information The handbook provides actionable advice to facilitate disaster preparedness in newsrooms, and best practices for how to verify and use information, photos and videos provided by the crowd.

Who

While it primarily targets journalists and aid providers, the handbook can be used by anyone. It’s advice and guidance are valuable whether you are a news journalist, citizen reporter, relief responder, volunteer, journalism school student, emergency communication specialist, or an academic researching social media.

Interesting reading.

Now what we need is a handbook of common errors for reviewers.

I first saw this in Pete Warden’s Five short links, 18 March 2014.

Deep Belief in Javascript

Wednesday, March 26th, 2014

Deep Belief in Javascript

From the webpage:

It’s an implementation of the Krizhevsky convolutional neural network architecture for object recognition in images, running entirely in the browser using Javascript and WebGL!

I built it so people can easily experiment with a classic deep belief approach to image recognition themselves, to understand both its limitations and its power, and to demonstrate that the algorithms are usable even in very restricted client-side environments like web browsers.

A very impressive demonstration of the power of Javascript to say nothing of neural networks.

You can submit your own images for “recognition.”

I first saw this in Nat Torkington’s Four short links: 24 March 2014.

Torkington Conjecture (with Corollary)

Wednesday, March 26th, 2014

Torkington Conjecture: Systems that are hardest to attack are also the ones that are hardest for Normal People to use.

Corollary: Sufficiently stupid users are indistinguishable from intelligent attackers?

First published at Four short links: 20 March 2014 by Nat Torkington.

I have serious reservations about the Torkington Conjecture and its corollary.

The Torkington conjecture confuses the ease of use (by “Normal People”) with vulnerability to attack.

It isn’t hard to list systems that are relatively secure from attack and easy for “Normal People” to use.

My short list includes:

  • Nuclear weapon production facilities
  • Nuclear power plants
  • Police stations
  • Prisons
  • The WhiteHouse

All of which are staffed by and used by “Normal People.”

The critical difference between those systems and digital systems? They were designed to be secure, or at least relatively so. No system is completely secure so how secure is a requirements question in design.

One possible counter-conjecture: Securing systems not designed to be secure makes them harder to use but not any more secure.

The corollary to the Torkington Conjecture: “Sufficiently stupid users are indistinguishable from intelligent attackers?” fails in the face of the Snowden experience. Snowden borrowed passwords from other sysadmins. Sysadmins who don’t number themselves among “sufficiently stupid users.”

Focusing security efforts on users is misguided. Users have not, do not and will not compensate for non-secure designs. Users, as we all are, are inconsistent, lazy, subject to having bad days and doing dumb things.

The common factor in all the relatively secure systems I mentioned above is that they were designed to compensate for user error.

Or to put it another way, security that depends on every user being perfect every day isn’t much in the way of security.

Big Data: Humans Required

Wednesday, March 26th, 2014

Big Data: Humans Required by Sherri Hammons.

From the post:

These simple examples outline the heart of the problem with data: interpretation. Data by itself is of little value. It is only when it is interpreted and understood that it begins to become information. GovTech recently wrote an article outlining why search engines will not likely replace actual people in the near future. If it were merely a question of pointing technology at the problem, we could all go home and wait for the Answer to Everything. But, data doesn’t happen that way. Data is very much like a computer: it will do just as it’s told. No more, no less. A human is required to really understand what data makes sense and what doesn’t. But, even then, there are many failed projects.

See Sherri’s post for a conversation overheard and a list of big data fallacies.

The same point has been made before but Sherri’s is a particularly good version of it.

Since it’s not news, at least to anyone who has been paying attention in the 20th – 21st century, the question becomes why do we keep making that same mistake over and over again?

That is relying on computers for “the answer” rather asking humans to setup the problem for a computer and to interpret the results.

Just guessing but I would say it has something to do with our wanting to avoid relying on other people. That in some manner, we are more independent, powerful, etc. if we can rely on machines instead of other people.

Here’s one example: Once upon a time if I wanted to hear Stabat Mater I would have to attend a church service and participate in its singing. In an age of iPods and similar devices, I can enjoy it in a cone of music that isolates me from my physical surrounding and people around me.

Nothing wrong with recorded music, but the transition from a communal, participatory setting to being in a passive, self-chosen sound cocoon seems lossy to me.

Can we say the current fascination with “big data” and the exclusion of people is also lossy?

Yes?

I first saw this in Nat Torkington’s Four short links: 18 March 2014.

Vulnerabilities…

Wednesday, March 26th, 2014

Vulnerabilities – the world through the eyes of hackers by Or Weis.

The executive summary:

The Art of cyber-warfare has much in common with the art of war on the classic battlefield. To emerge victorious one must know oneself, the enemy and the battlefield.

Vulnerabilities are in the very essence of our reality and becoming even more fundamental in the world of cyber-security. Hackers or attackers see vulnerabilities all around them, knowing they are key to achieving their goals.

By understanding the key fundamentals of the attacker view, defenders can turn the tides of battle. Understanding the costs for mounting an attack, and the different stages of an attack, allow defendersto impose costs that can hinder or even thwart attacks from the get go; using principles like “The Great Wall” and “Weakest Link” detection.

Using a frequently updated ‘Common Operational Picture’ defenders can list their potential threats-understanding the likelihood, risk, and counter measures- enabling them to build and maintain powerful security profiles.

Doesn’t this echo The Art of War by Sun Tzu?

If so, then why isn’t The Art of War required reading in CS programs?

The paper itself has a militaristic/messianic tone to it so it make for fun reading. You can imagine yourself resisting the forces of darkness, etc.

Whatever motivates you to work towards greater software and network security works for me.

I first saw this in Nat Torkington’s Four short links: 17 March 2014.

Is Parallel Programming Hard,…

Tuesday, March 25th, 2014

Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul E. McKenney.

From Chapter 1 How To Use This Book:

The purpose of this book is to help you understand how to program shared-memory parallel machines without risking your sanity.[1] By describing the algorithms and designs that have worked well in the past, we hope to help you avoid at least some of the pitfalls that have beset parallel-programming projects. But you should think of this book as a foundation on which to build, rather than as a completed cathedral. Your mission, if you choose to accept, is to help make further progress in the exciting field of parallel programming—progress that should in time render this book obsolete. Parallel programming is not as hard as some say, and we hope that this book makes your parallel-programming projects easier and more fun.

In short, where parallel programming once focused on science, research, and grand-challenge projects, it is quickly becoming an engineering discipline. We therefore examine the specific tasks required for parallel programming and describe how they may be most effectively handled. In some surprisingly common special cases, they can even be automated.

This book is written in the hope that presenting the engineering discipline underlying successful parallel-programming projects will free a new generation of parallel hackers from the need to slowly and painstakingly reinvent old wheels, enabling them to instead focus their energy and creativity on new frontiers. We sincerely hope that parallel programming brings you at least as much fun, excitement, and challenge that it has brought to us!

I should not have been surprised by:

16.4 Functional Programming for Parallelism

When I took my first-ever functional-programming class in the early 1980s, the professor asserted that the side- effect-free functional-programming style was well-suited to trivial parallelization and analysis. Thirty years later, this assertion remains, but mainstream production use of parallel functional languages is minimal, a state of affairs that might well stem from this professor’s additional assertion that programs should neither maintain state nor do I/O. There is niche use of functional languages such as Erlang, and multithreaded support has been added to several other functional languages, but mainstream production usage remains the province of procedural languages such as C, C++, Java, and FORTRAN (usually augmented with OpenMP or MPI).

The state of software vulnerability is testimony enough to the predominance of C, C++, and Java.

I’m not real sure I would characterize Erlang as a “niche” language. Niche languages aren’t often found running telecommunications networks, or at least that is my impression.

I would take McKenney’s comments as a challenge to use functional languages such as Clojure and Erlang to make in-roads into mainstream production.

While you use this work to learn the procedural approach to parallelism, you can be building contrasts to a functional one.

I first saw this in Nat Torkington’s Four short links: 13 March 2014.

Network Analysis and the Law:…

Tuesday, March 25th, 2014

Network Analysis and the Law: Measuring the Legal Importance of Precedents at the U.S. Supreme Court by James H. Fowler, et al.

Abstract:

We construct the complete network of 26,681 majority opinions written by the U.S. Supreme Court and the cases that cite them from 1791 to 2005. We describe a method for using the patterns in citations within and across cases to create importance scores that identify the most legally relevant precedents in the network of Supreme Court law at any given point in time. Our measures are superior to existing network-based alternatives and, for example, offer information regarding case importance not evident in simple citation counts. We also demonstrate the validity of our measures by showing that they are strongly correlated with the future citation behavior of state courts, the U.S. Courts of Appeals, and the U.S. Supreme Court. In so doing, we show that network analysis is a viable way of measuring how central a case is to law at the Court and suggest that it can be used to measure other legal concepts.

Danny Bickson pointed this paper out in: Spotlight: Ravel Law – introducing graph analytics to law research.

Interesting paper but remember that models are just that, models. Subsets of a more complex reality.

For example, I don’t know of any models of the Supreme Court (U.S.) that claim to be able to predict The switch in time that saved nine. If you don’t know the story, it makes really interesting reading. I won’t spoil the surprise but you will come away feeling the law is less “fixed” than you may have otherwise thought.

I commend this paper to you but if you need of legal advice, it’s best to consult an attorney and not an model.

USGS Maps!

Tuesday, March 25th, 2014

USGS Maps (Google Map Gallery)

Wicked cool!

Followed a link from this post:

Maps were made for public consumption, not for safekeeping under lock and key. From the dawn of society, people have used maps to learn what’s around us, where we are and where we can go.

Since 1879, the U.S. Geological Survey (USGS) has been dedicated to providing reliable scientific information to better understand the Earth and its ecosystems. Mapping is an integral part of what we do. From the early days of mapping on foot in the field to more modern methods of satellite photography and GPS receivers, our scientists have created over 193,000 maps to understand and document changes to our environment.

Government agencies and NGOs have long used our maps everything from community planning to finding hiking trails. Farmers depend on our digital elevation data to help them produce our food. Historians look to our maps from years past to see how the terrain and built environment have changed over time.

While specific groups use USGS as a resource, we want the public at-large to find and use our maps, as well. The content of our maps—the information they convey about our land and its heritage—belongs to all Americans. Our maps are intended to serve as a public good. The more taxpayers use our maps and the more use they can find in the maps, the better.

We recognize that our expertise lies in mapping, so partnering with Google, which has expertise in Web design and delivery, is a natural fit. Google Maps Gallery helps us organize and showcase our maps in an efficient, mobile-friendly interface that’s easy for anyone to find what they’re looking for. Maps Gallery not only publishes USGS maps in high-quality detail, but makes it easy for anyone to search for and discover new maps.

My favorite line:

Maps were made for public consumption, not for safekeeping under lock and key.

Very true. Equally true for all the research and data that is produced at the behest of the government.