12 Things TEDx Speakers do that Preachers Don’t.

April 19th, 2014

12 Things TEDx Speakers do that Preachers Don’t.

From the post:

Ever seen a TEDx talk? They’re pretty great. Here’s one I happen to enjoy, and have used in a couple of sermons. I’ve wondered for a long time, “How in the world do each of these talks end up consistently blowing me away?” So I did some research, and found the TEDx talk guidelines for speakers. Some of the advice was basic – but some of it was unexpected. Much of it, I think, is a welcome wake up call to preachers who are communicating in a 21st century postmodern, post-Christian context. Obviously, some of this doesn’t fit with a preacher’s ethos: but much of it does.

That said, here are 12 things TEDx speakers do that preachers usually don’t:

A great retelling of the guidelines for TEDx speakers!

With the conference season (summer) rapidly approaching, now is the time to take this advice to heart!

Imagine a conference presentation without the filler than everyone in the room already knows (or should to be attending the conference). I keep longing for papers that don’t repeat largely the same introduction as every other paper in the area.

Yes, graphs have nodes/vertices, edges/arcs and you are g-o-i-n-g t-o l-a-b-e-l t-h-e-m. ;-)

The advice for TEDx speakers is equally applicable to webcasts and podcasts.

New trends in sharing data science work

April 19th, 2014

New trends in sharing data science work

Danny Bickson writes:

I got the following venturebeat article from my colleague Carlos Guestrin.

It seems there is an interesting trend of allowing data scientists to share their work: Imagine if a company’s three highly valued data scientists can happily work together without duplicating each other’s efforts and can easily call up the ingredients and results of each other’s previous work.

That day has come. As the data scientist arms race continues, data scientists might want to join forces. Crazy idea, right? Two San Francisco startups — Domino Data Lab and Sense — have emerged recently with software to let data scientists collaborate on multiple projects. In a way, it’s like code storehouse GitHub for the data science world. A Montreal startup named Plot.ly has been talking about the same themes, but it brings a more social twist. Another startup, Mode Analytics, is building software for data analysts to ask questions of data without duplicating previous efforts. And at least one more mature software vendor, Alpine Data Labs, has been adding features to help many colleagues in a company apply algorithms to code on one central hub.

If you aren’t already registered for GraphLab Conference 2014, notice that Alpine Data Labs, Domino Data Labs, Mode Analytics, Plot.ly, and, Sense will all be at the GraphLab Conference.

Go ahead, register for the GraphLab conference. At the very worst you will learn something. If you socialize a little bit, you will meet some of the brightest graph people on the planet.

Plus, when the history of “sharing” in data science is written, you will have attended one of the early conferences on sharing code for data science. After years of hoarding data (where you now see open data) and beginning to see code sharing, data science is developing a different model.

And you were there to cheer them on!

Apache Lucene/Solr 4.7.2

April 19th, 2014

Apache Lucene 4.7.2

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene Changes.txt

Fixes potential index corruption, LUCENE-5574.

Apache Solr 4.7.2

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr Changes.txt

In view of possible index corruption, I would not take this as an optional upgrade.

GraphChi Users Survey

April 19th, 2014

GraphChi Users Survey

From the form:

This survey is used to find out about experiences of users of GraphChi. These results will be used in Aapo Kyrola’s Ph.D. thesis.

If you are using GraphChi, your experiences can help with Aapo Kyrola’s Ph.D. thesis.

Pass this along to anyone you know using GraphChi (and try GraphChi yourself).

Visual Programming Languages – Snapshots

April 19th, 2014

Visual Programming Languages – Snapshots by Eric Hosick.

If you are interested in symbolic topic map authoring or symbolic authoring for other purposes, this a a must-see site for you!

Eric has collected (as of today) one-hundred and forty (140) examples of visual programming languages.

I am sure there are some visualization techniques that were not used in these examples but offhand, I can’t say which ones. ;-)

Definitely a starting point for any new visual interfaces.

Streamtools – Update

April 19th, 2014

streamtools 0.2.4

From the webpage:

This release contains:

  • toEmail and fromEmail blocks: use streamtools to receive and create emails!
  • linear modelling blocks: use streamtools to perform linear and logistic regression using stochastic gradient descent.
  • GUI updates : a new block reference/creation panel.
  • a kullback leibler block for comparing distributions.
  • added a tutorials section to streamtools available at /tutorials in your streamtools server.
  • many small bug fixes and tweaks.

See also: Introducing Streamtools.

+1 on news input becoming more stream-like. But streams, of water and news, can both become polluted.

Filtering water is a well-known science.

Filtering information is doable but with less certain results.

How do you filter your input? (Not necessarily automatically, algorithmic, etc. You have to define the filter first, then choose the means implement it.)

I first saw this in a tweet by Micahael Dewar.

Algorithmic Newspapers?

April 19th, 2014

A print newspaper generated by robots: Is this the future of media or just a sideshow? by Mathew Ingram.

From the post:

What if you could pick up a printed newspaper, but instead of a handful of stories hand-picked by a secret cabal of senior editors in a dingy newsroom somewhere, it had pieces that were selected based on what was being shared — either by your social network or by users of Facebook, Twitter etc. as a whole? Would you read it? More importantly, would you pay for it?

You can’t buy one of those yet, but The Guardian (see disclosure below) is bringing an experimental print version it has been working on to the United States for the first time: a printed paper that is generated entirely — or almost entirely — by algorithms based on social-sharing activity and other user behavior by the paper’s readers. Is this a glimpse into the future of newspapers?

According to Digiday, the Guardian‘s offering — known as #Open001 — is being rolled out later this week. But you won’t be able to pick one up at the corner store: only 5,000 copies will be printed each month, and they are going to the offices of media and ad agencies. In other words, it’s as much a marketing effort at this point for the Guardian (which isn’t printed in the U.S.) as it is a publishing experiment.

Mathew recounts the Guardian effort, similar services and questions whether robots can preserve serendipity?, alleged to be introduced by editors. It’s a good read.

The editors at the Guardian may introduce stories simply because they are “important,” but is that the case for other media outlets?

I know that is often alleged but peer review was alleged to lead to funding good projects and insuring that good papers were published. The alleged virtues of peer review, when tested, have been found to be false.

How would you test for “serendipity” in a news outlet? That is not simply running stories because they are popular in the local market but because they are “important?”

Or to put it another way: Is the news from a local media outlet already being personalized/customized?

Tools for Reproducible Research [Reproducible Mappings]

April 19th, 2014

Tools for Reproducible Research by Karl Broman.

From the post:

A minimal standard for data analysis and other scientific computations is that they be reproducible: that the code and data are assembled in a way so that another group can re-create all of the results (e.g., the figures in a paper). The importance of such reproducibility is now widely recognized, but it is still not so widely practiced as it should be, in large part because many computational scientists (and particularly statisticians) have not fully adopted the required tools for reproducible research.

In this course, we will discuss general principles for reproducible research but will focus primarily on the use of relevant tools (particularly make, git, and knitr), with the goal that the students leave the course ready and willing to ensure that all aspects of their computational research (software, data analyses, papers, presentations, posters) are reproducible.

As you already know, there is a great deal of interest in making scientific experiments reproducible in fact as well as in theory.

At the time time, there has been an increasing interest in reproducible data analysis as it concerns the results from reproducible experiments.

One logically follows on from the other.

Of course, reproducible data analysis as far as any combination of data from different sources, would simply cookie cutter follow the combining of data in a reported experiment.

But what if a user wants to replicate the combining (mapping) of data with other data? From different sources? That could be followed by rote by others but they would not know the underlying basis for the choices made in the mapping.

Experiments take a great deal of effort to identify the substances used in an experiment. When data is combined from different sources, why not do the same for the data?

I first saw this in a tweet by YihuI Xie.

On InChI and evaluating the quality of cross-reference links

April 19th, 2014

On InChI and evaluating the quality of cross-reference links by Jakub Galgonek and Jiří Vondrášek. (Journal of Cheminformatics 2014, 6:15 doi:10.1186/1758-2946-6-15)

Abstract:

Background

There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results

We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions

We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Another use case for topic maps don’t you think?

Rather than a mapping keyed on recognition of a single identifier, have the mapping keyed to the recognition of several key/value pairs.

I don’t think there is an abstract answer as to the optimum number of key/value pairs that must match for identification. Experience would be a much better guide.

ROCK, RACK And Knowledge Retention

April 18th, 2014

Roundtable on Knowledge Retention Techniques held on 21 May 2013.

From the post:

Back in May 2013, we held a Roundtable on Knowledge Retention Techniques. Carla Newman, Shaharudin Mohd Ishak and Ashad Ahmed very graciously shared with us their journey and experiences in Knowledge Retention.

Three videos, Carla Newman on ROCK (Retention of Critical Knowledge), Shaharudin Mohd Ishak on IE Singapore’s RACK (Retention of All Critical Knowledge), and Ashad Ahmend on Knowledge Retention.

Any knowledge problem of interest to Shell Oil Company is of interest to me! ;-)

At what junctures in a knowledge retention process would topic maps have the greatest impact?

Not really interested in disrupting current approaches or processes but in discovering where topic maps could be a value add to existing systems.

Saving Output of nltk Text.Concordance()

April 18th, 2014

Saving Output of NLTK Text.Concordance() by Kok Hua.

From the post:

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Text mining is a very common part of topic map construction so tools that help with that task are always welcome.

To be honest, I am citing this because it may become part of several small tools for processing standards drafts. Concordance software is not rare but a full concordance of a document seems to frighten some proof readers.

The current thinking being if only the “important” terms are highlighted in context, that some proof readers will be more likely to use the work product.

The same principal applies to the authoring of topic maps as well.

Learn Clojure…

April 18th, 2014

Learn Clojure – Clojure Koans Walkthrough in Light Table IDE

You have heard of Clojure and no doubt the Clojure Koans.

Now there are videos solving the Clojure Koans using the Light Table IDE.

I first saw this at Clojure Koans by Christopher Bare.

Marketing Strategy for Topic Maps?

April 18th, 2014

Should you reveal a P = NP algorithm? by Lance Fortnow.

I think Lance captures the best topic map marketing strategy when he says:

Once again though no one will take you seriously unless you really have a working algorithm. If you just tell Google you have an algorithm for NP-complete problem they will just ignore you. If you hand them their private keys then they will listen. (emphasis added)

I can think of endless areas that I think would benefit from topic maps, but where foresight fails is in choosing an area and degree of granularity that would interest others.

A topic map about Justin Bieber would be useful in some sense but there sites that organize that data already. What degree of “better” would be required for data about a sulky Canadian to be successful?

Or take OMB data for example. You may remember my post: Free Sequester Data Here! where I list a series of post on converting OMB data into machine readable files and the issues with that data.

Apparently I was the only person in the United States who did not realize OMB reports are stylized fiction meant to serve political ends. The report in question had no connection to reality other than mostly getting department and program names correct.

Putting the question to you: What would make a good (doesn’t have to be great) demonstration of topic maps? One that would resonate with potential customers? BTW, you need to say why it would be a good demonstration of topic maps. In what respects are other resources in the area deficient and why would that deficiency drive someone to seek out topic maps?

I know that is a very broad question but it is an important one to ask and hopefully, to work towards a useful answer.

A/B Tests and Facebook

April 18th, 2014

The reality is most A/B tests fail, and Facebook is here to help by Kaiser Fung.

From the post:

Two years ago, Wired breathlessly extolled the virtues of A/B testing (link). A lot of Web companies are in the forefront of running hundreds or thousands of tests daily. The reality is that most A/B tests fail.

A/B tests fail for many reasons. Typically, business leaders consider a test to have failed when the analysis fails to support their hypothesis. “We ran all these tests varying the color of the buttons, and nothing significant ever surfaced, and it was all a waste of time!” For smaller websites, it may take weeks or even months to collect enough samples to read a test, and so business managers are understandably upset when no action can be taken at its conclusion. It feels like waiting for the train which is running behind schedule.

Bad outcome isn’t the primary reason for A/B test failure. The main ways in which A/B tests fail are:

  1. Bad design (or no design);
  2. Bad execution;
  3. Bad measurement.

These issues are often ignored or dismissed. They may not even be noticed if the engineers running the tests have not taken a proper design of experiments class. However, even though I earned an A at school, it wasn’t until I started running real-world experiments that I really learned the subject. This is an area in which theory and practice are both necessary.

The Facebook Data Science team just launched an open platform for running online experiments, called PlanOut. This looks like a helpful tool to avoid design and execution problems. I highly recommend looking into how to integrate it with your website. An overview is here, and a more technical paper (PDF) is also available. There is a github page.

The rest of this post gets into some technical, sausage-factory stuff, so be warned.

For all of your software tests, do you run any A/B tests on your interface?

Or is your response to UI criticism, “…well, but all of us like it.” That’s a great test for a UI.

If you don’t read any other blog post this weekend, read Kaiser’s take on A/B testing.

VOWL: Visual Notation for OWL Ontologies

April 18th, 2014

VOWL: Visual Notation for OWL Ontologies

Abstract:

The Visual Notation for OWL Ontologies (VOWL) defines a visual language for the user-oriented representation of ontologies. It provides graphical depictions for elements of the Web Ontology Language (OWL) that are combined to a force-directed graph layout visualizing the ontology.

This specification focuses on the visualization of the ontology schema (i.e. the classes, properties and datatypes, sometimes called TBox), while it also includes recommendations on how to depict individuals and data values (the ABox). Familiarity with OWL and other Semantic Web technologies is required to understand this specification.

At the end of the specification there is an interesting example but as a “force-directed graph layout” it captures one of the difficulties I have with that approach.

I have this unreasonable notion that a node I select and place in the display should stay where I have placed it, not shift about because I have moved some other node. Quite annoying and I don’t find it helpful at all.

I first saw this at: VOWL: Visual Notation for OWL Ontologies

Non-Painful Presentations

April 18th, 2014

Looking to give fewer painful presentations?

Want advice to forward on non-painful presenting?

If you answered “yes,” to either of those questions, read: This Advice From IDEO’s Nicole Kahn Will Transform the Way You Give Presentations.

Nicole Kahn has short and engaging advice that boils down to three (3) touchstones for making a non-painful and perhaps even compelling presentation.

It’s easier to sell an idea, technology or product if the audience isn’t in pain after your presentation.

Precision from Disaggregation

April 18th, 2014

Building Precise Maps with Disser by Brandon Martin-Anderson.

From the post:

Spatially aggregated statistics are pretty great, but what if you want more precision? Here at Conveyal we built a utility to help with that: aggregate-disser. Let me tell you how it works.

Let’s start with a classic aggregated data set – the block-level population counts from the US Census. Here’s a choropleth map of total population for blocks around lower Manhattan and Brooklyn. The darkest shapes contain about five thousand people.

Brandon combines census data with other data sets to go from 5,000 person census blocks to locating every job and bed in Manhattan into individual buildings.

Very cool!

Not to mention instructive when you encounter group subjects that need to be disaggregated before being combined with other data.

I first saw this in a tweet by The O.C.R.

Announcing Schema.org Actions

April 17th, 2014

Announcing Schema.org Actions

From the post:

When we launched schema.org almost 3 years ago, our main focus was on providing vocabularies for describing entities — people, places, movies, restaurants, … But the Web is not just about static descriptions of entities. It is about taking action on these entities — from making a reservation to watching a movie to commenting on a post.

Today, we are excited to start the next chapter of schema.org and structured data on the Web by introducing vocabulary that enables websites to describe the actions they enable and how these actions can be invoked.

The new actions vocabulary is the result of over two years of intense collaboration and debate amongst the schema.org partners and the larger Web community. Many thanks to all those who participated in these discussions, in particular to members of the Web Schemas and Hydra groups at W3C. We are hopeful that these additions to schema.org will help unleash new categories of applications.

From http://schema.org/Action:

Thing > Action

An action performed by a direct agent and indirect participants upon a direct object. Optionally happens at a location with the help of an inanimate instrument. The execution of the action may produce a result. Specific action sub-type documentation specifies the exact expectation of each argument/role.

Fairly coarse but I can see how it would be useful.

BTW, the examples are only available in JSON-LD. Just in case you were wondering.

Given the coarseness of schema.org and its success, due consideration should be given to semantics of “appropriate” coarseness for any particular task.

…lotteries to pick NIH research-grant recipients

April 17th, 2014

Wall Street Journal op-ed advocates lotteries to pick NIH research-grant recipients by Steven T. Corneliussen

From the post:

The subhead for the Wall Street Journal op-ed “Taking the Powerball approach to funding medical research” summarizes its coauthors’ argument about research funding at the National Institutes of Health (NIH): “Winning a government grant is already a crapshoot. Making it official by running a lottery would be an improvement.”

The coauthors, Ferric C. Fang and Arturo Casadevall, serve respectively as a professor of laboratory medicine and microbiology at the University of Washington School of Medicine and as professor and chairman of microbiology and immunology at the Albert Einstein College of Medicine of Yeshiva University.

At a time when funding levels are historically low, they note, grant peer review remains expensive. The NIH Center for Scientific Review has a $110 million annual budget. Grant-submission and grant-review processes extract an additional high toll from participants. Within this context, the coauthors summarize criticisms of NIH peer review. They mention a 2012 Nature commentary that argued, they say, that the system’s structure “encourages conformity.” In particular, after mentioning a study in the journal Circulation Research, they propose that concerning projects judged good enough for funding, “NIH peer reviewers fare no better than random chance when it comes to predicting how well grant recipients will perform.”

Nature should use a “mock” lottery to judge the acceptance of papers along side its normal peer review process. Publish the results after a year of peer review “competing” with a lottery.

Care to speculate on the results as evaluated by Nature readers?

Tutorial Statistical Graph Analysis

April 17th, 2014

Tutorial Statistical Graph Analysis by Aapo Kyrola.

From the post:

GraphChi-DB can be used for efficiently querying induced subgraphs from very large networks. Thus, you can for example, easily sample a vertex, and retrive induced neighborhood graph of the vertex. Or you can choose a random set of vertices and compute their induced subgraph.

Assuming you have the data loaded in database, and the GraphChiDatabase object in a value “DB”, here is how you request edges for the induced subgraph of a set of vertices:

GraphChi-DB is released two days ago, GraphChi-DB [src released] and today you have a tutorial written for it.

Not bad, not bad at all.

Clojure Procedural Dungeons

April 17th, 2014

Clojure Procedural Dungeons

From the webpage:

When making games, there are two ways to make a dungeon. The common method is to design one in the CAD tool of our choice (or to draw one in case of 2D games).

The alternative is to automatically generate random Dungeons by using a few very powerful algorithms. We could automatically generate a whole game world if we wanted to, but let’s take one step after another.

In this Tutorial we will implement procedural Dungeons in Clojure, while keeping everything as simple as possible so everyone can understand it.

Just in case you are interesting in a gaming approach for a topic maps interface.

Not as crazy as that may sound. One of the brightest CS types I ever knew spend a year playing a version of Myst from start to finish.

Think about app sales if you can make your interface addictive.

Suggestion: Populate your topic map authoring interface with trolls (accounting), smiths (manufacturing), cavalry (shipping), royalty (managment), wizards (IT), etc. and make collection of information about their information into tokens, spells, etc. Sprinkle in user preference activities and companions.

That would be a lot of work but I suspect you would get volunteers to create new levels as your information resources evolve.

Cloudera Live (beta)

April 17th, 2014

Cloudera Live (beta)

From the webpage:

Try a live demo of Hadoop, right now.

Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to:

  • Learn Hue, the Hadoop User Interface developed by Cloudera
  • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!)
  • Develop workflows using Apache Oozie

Great news for people interested in Hadoop!

Question: Will this become the default delivery model for test driving software and training?

Enjoy!

A Gentle Introduction to Scikit-Learn…

April 17th, 2014

A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library by Jason Brownlee.

From the post:

If you are a Python programmer or you are looking for a robust library you can use to bring machine learning into a production system then a library that you will want to seriously consider is scikit-learn.

In this post you will get an overview of the scikit-learn library and useful references of where you can learn more.

Nothing new if you are already using Scikit-Learn but a very nice introduction with additional resources to pass onto others.

Save yourself some time in gathering materials to spread the use of Scikit-Learn. Bookmark and forward today!

Reproducible Research/(Mapping?)

April 17th, 2014

Implementing Reproducible Research edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng.

From the webpage:

In many of today’s research fields, including biomedicine, computational tools are increasingly being used so that the results can be reproduced. Researchers are now encouraged to incorporate software, data, and code in their academic papers so that others can replicate their research results. Edited by three pioneers in this emerging area, this book is the first one to explore this groundbreaking topic. It presents various computational tools useful in research, including cloud computing, data repositories, virtual machines, R’s Sweave function, XML-based programming, and more. It also discusses legal issues and case studies.

There is a growing concern over the ability of scientists to reproduce the published results of other scientists. The Economist rang one of many alarm bells when it published: Trouble at the lab [Data Skepticism].

From the introduction to Reproducible Research:

Literate statistical programming is a concept introduced by Rossini () that builds on the idea of literate programming as described by Donald Knuth. With literate statistical programming, one combines the description of a statistical analysis and the code for doing the statistical analysis into a single document. Subsequently, one can take the combined document and produce either a human-readable document (i.e. PDF) or a machine readable code file. An early implementation of this concept was the Sweave system of Leisch which uses R as its programming language and LATEX as its documentation language (). Yihui Xie describes his knitr package which builds substantially on Sweave and incorporates many new ideas developed since the initial development of Sweave. Along these lines, Tanu Malik and colleagues describe the Science Object Linking and Embedding framework for creating interactive publications that allow authors to embed various aspects of computational research in document, creating a complete research compendium. Tools

Of course, we all cringe when we read that a drug company can reproduce only 1/4 of 67 “seminal” studies.

What has me curious is why we don’t have the same reaction when enterprise IT systems require episodic remapping, which requires the mappers to relearn what was known at the time of the last remapping? We all know that enterprise (and other) IT systems change and evolve, but practically speaking, no effort is make to capture the knowledge that would reduce the time, cost and expense of every future remapping.

We can see the expense and danger of science not being reproducible, but when our own enterprise data mappings are not reproducible, that’s just the way things are.

Take inspiration from the movement towards reproducible science and work towards reproducible semantic mappings.

I first saw this in a tweet by Victoria Stodden.

Expert vs. Volunteer Semantics

April 17th, 2014

The variability of crater identification among expert and community crater analysts by Stuart J. Robbins, et al.

Abstract:

The identification of impact craters on planetary surfaces provides important information about their geological history. Most studies have relied on individual analysts who map and identify craters and interpret crater statistics. However, little work has been done to determine how the counts vary as a function of technique, terrain, or between researchers. Furthermore, several novel internet-based projects ask volunteers with little to no training to identify craters, and it was unclear how their results compare against the typical professional researcher. To better understand the variation among experts and to compare with volunteers, eight professional researchers have identified impact features in two separate regions of the moon. Small craters (diameters ranging from 10 m to 500 m) were measured on a lunar mare region and larger craters (100s m to a few km in diameter) were measured on both lunar highlands and maria. Volunteer data were collected for the small craters on the mare. Our comparison shows that the level of agreement among experts depends on crater diameter, number of craters per diameter bin, and terrain type, with differences of up to ∼±45. We also found artifacts near the minimum crater diameter that was studied. These results indicate that caution must be used in most cases when interpreting small variations in crater size-frequency distributions and for craters ≤10 pixels across. Because of the natural variability found, projects that emphasize many people identifying craters on the same area and using a consensus result are likely to yield the most consistent and robust information.

The identification of craters on the Moon may seem far removed from your topic map authoring concerns but I would suggest otherwise.

True the paper is domain specific in some of it concerns (crater age, degradation, etc.) but the most important question was whether volunteers in aggregate could be as useful as experts in the identification of craters?

The author conclude:

Except near the minimum diameter, volunteers are able to identify craters just as well as the experts (on average) when using the same interface (the Moon Mappers interface), resulting in not only a similar number of craters, but also a similar size distribution. (page 34)

I find that suggestive for mapping semantics because unlike moon craters, what words mean (and implicitly why) are a daily concern for users, including ones in your enterprise.

You can, of course, employ experts to re-interpret what they have been told by some of your users into the expert’s language and produce semantic integration based on the expert’s understanding or mis-understanding of your domain.

Or, you can use your own staff, with experts to facilitate encoding their understanding of your enterprise semantics, as in a topic map.

Recalling that the semantics for your enterprise aren’t “out there” in the ether but residing within the staff that make up your enterprise.

I still see an important role for experts but it isn’t as the source of your semantics, rather at the hunters who assist in capturing your semantics.

I first saw this in a tweet by astrobites that lead me to: Crowd-Sourcing Crater Identification by Brett Deaton.

GraphLab Create: Upgrade

April 17th, 2014

GraphLab Create: Upgrade

From the webpage:

The latest version of graphlab-create is 0.2 beta. See what’s new for information about new features and the release notes for detailed additions and compatibility changes.

From the what’s new page:

GraphLab Data Structures:

SFrame (Scalable tabular data structure):

Graph:

Machine Learning Toolkits:

Recommender functionality:

General Machine Learning:

Cloud:

  • Support for all AWS regions
  • Secured client server communication using strong, standards-based encryption
  • CIDR rule specification for Amazon EC2 instance identification

For detailed information about additional features and compatibility changes, see the release notes.

For known issues and feature requests visit the forum!

Cool! Be sure to pass this news along!

Future Values of Merges

April 17th, 2014

Multilisp: A Language for Concurrent Symbolic Computation by Robert H. Halstead. (ACM Transactions on Programming Languages and Systems, Vol. 7, No. 4, October 1985, Pages 501-538.)

Abstract:

Multilisp is a version of the Lisp dialect Scheme extended with constructs for parallel execution. Like Scheme, Multilisp is oriented toward symbolic computation. Unlike some parallel programming languages, Multilisp incorporates constructs for causing side effects and for explicitly introducing parallelism. The potential complexity of dealing with side effects in a parallel context is mitigated by the nature of the parallelism constructs and by support for abstract data types: a recommended Multilisp programming style is presented which, if followed, should lead to highly parallel, easily understandable programs. Multilisp is being implemented on the 32-processor Concert multiprocessor; however, it is ulti-mately intended for use on larger multiprocessors. The current implementation, called Concert Multilisp, is complete enough to run the Multilisp compiler itself and has been run on Concert prototypes including up to eight processors. Concert Multilisp uses novel techniques for task scheduling and garbage collection. The task scheduler helps control excessive resource utilization by means of an unfair scheduling policy; the garbage collector uses a multiprocessor algorithm based on the incremental garbage collector of Baker.

Of particular interest:

Multilisp’s principal construct for both creating tasks and synchronizing among them is the future. The construct ( future X ) immediately returns a future for the value of the expression X and concurrently begins evaluating X. When the evaluation of X yields a value, that value replaces the future. The future is said to be initially undetermined; it becomes determined when its value has been computed. An operation (such as addition) that needs to know the value of an undetermined future will be suspended until the future becomes determined, but many operations, such as assignment and parameter passing, do not need to know anything about the values of their operands and may be performed quite comfortably on undetermined futures. The use of futures often exposes surprisingly large amounts of parallelism in a program, as illustrated by a Quicksort program given in Figure 1.

The use of a future value for merge operations, which could avoid re-processing a topic map for each cycle of merging, sounds promising.

Deferral of the results isn’t just an old Lisp idea as you will find in: Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner, and Sebastian Stolzenberg.

Abstract:

Counting problems, determining the number of possible states of a large system under certain constraints, play an important role in many areas of science. They naturally arise for complex disordered systems in physics and chemistry, in mathematical graph theory, and in computer science. Counting problems, however, are among the hardest problems to access computationally. Here, we suggest a novel method to access a benchmark counting problem, finding chromatic polynomials of graphs. We develop a vertex-oriented symbolic pattern matching algorithm that exploits the equivalence between the chromatic polynomial and the zero-temperature partition function of the Potts antiferromagnet on the same graph. Implementing this bottom-up algorithm using appropriate computer algebra, the new method outperforms standard top-down methods by several orders of magnitude, already for moderately sized graphs. As a first application, we compute chromatic polynomials of samples of the simple cubic lattice, for the first time computationally accessing three-dimensional lattices of physical relevance. The method offers straightforward generalizations to several other counting problems.

In lay person’s terms the work by Timme and company visits each node in a graph and records an expression that includes unkowns (futures?), that is the values at other nodes in the graph. Using pattern matching techniques, the algorithm then solves all of the unknowns and replaces them with appropriate values. How effective is this?

“The existing algorithm copies the whole network for each stage of the calculation and only changes one aspect of it each time,” explains Frank van Bussel of the Max Planck Institute for Dynamics and Self-Organization (MPIDS). Increasing the number of nodes dramatically increases the calculation time. For a square lattice the size of a chess board, this is estimated to be many billions of years. The new algorithm developed by the Göttingen-based scientists is significantly faster. “Our calculation for the chess board lattice only takes seven seconds,” explains Denny Fliegner from MPIDS. (A New Kind of Counting)

Hmmm, “many billions of years” versus “seven seconds.”

For further details on the pattern matching work see the project page at: Complex Disordered Systems: Statistical Physics and Symbolic Computation

Deferral of results looks like a fruitful area for research for topic maps in general and parallel processing of topic maps in particular.

I first saw the Multilisp paper in a tweet by Mom of Finland.

Hitchhiker’s Guide to Clojure

April 16th, 2014

Hitchhiker’s Guide to Clojure

From the webpage:

The following is a cautionary example of the unpredictable combination of Clojure, a marathon viewing of the BBC’s series “The Hitchhiker’s Guide to the Galaxy”, and a questionable amount of cheese.

There have been many tourism guides to the Clojure programming language. Some that easily come to mind for their intellectual erudition and prose are “The Joy of Touring Clojure”, “Touring Clojure”, “Clojure Touring”, and the newest edition of “Touring Clojure Touring”. However, none has surpassed the wild popularity of “The Hitchhiker’s Guide to Clojure”. It has sold over 500 million copies and has been on the “BigInt’s Board of Programming Language Tourism” for the past 15 years. While, arguably, it lacked the in-depth coverage of the other guides, it made up for it in useful practical tips, such as what to do if you find a nil in your pistachio. Most of all, the cover had the following words printed in very large letters: Don’t Worry About the Parens.

To tell the story of the book, it is best to tell the story of two people whose lives were affected by it: Amy Denn, one of the last remaining Pascal developers in Cincinnati, and Frank Pecan, a time traveler, guidebook researcher, and friend of Amy.

There isn’t any rule (that I’m aware of) that says computer texts must be written to be unfunny.

I think my only complaint is that the story is too short. ;-)

Enjoy!

Top secret MI5 files of First World War go online

April 16th, 2014

Top secret MI5 files of First World War go online

Well, that depends on your definition of “online.”

I was hopeful this collection would demonstrate the lack of need for long term secrecy for “top secret” files. That files are kept “secret” more to guard the interest of security staff than any legitimate reason.

Then I selected the record for:

‘Mata Hari’ alias MCCLEOD Margaretha Geertruida (Marguerite Gertrude): executed by the…

I found that I can view it for free, at The National Archives.

I’m not sure that is “online” in any important sense of the word.

You?

WordNet RDF

April 16th, 2014

WordNet RDF

From the webpage:

WordNet is supported by the National Science Foundation under Grant Number 0855157. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the creators of WordNet and do not necessarily reflect the views of the National Science Foundation.

This is the RDF version of WordNet, created by mapping the existing WordNet data into RDF. The data is structured according to the lemon model. In addition, links have been added from the following sources:

These links increase the usefulness of the WordNet data. If you would like to contribute extra linking to WordNet please Contact us.

Curious if you find it easier to integrate WordNet RDF with other data or the more traditional WordNet?

I first saw this in a tweet by Bob DuCharme.